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Development  of  Criterion  Measures  of  Air  T raffic  Controller  Performance 

Walter  C.  Borman,  Jerry  W.  Hedge,  Mary  Ann  Hanson,  Kenneth  T.  Bruskiewicz 
Personnel  Decisions  Research  Institutes,  Inc. 

Henry  Mogilka  and  Carol  Manning 
Federal  Aviation  Administration 
Laura  B.  Bunch  and  Kristen  E.  Horgen 
University  of  South  Florida  and 
Personnel  Decisions  Research  Institutes,  Inc. 


INTRODUCTION 

An  important  element  of  the  AT-SAT  predictor  de¬ 
velopment  and  validation  project  is  criterion  perfor- 
mance  measurement.  To  obtain  an  accurate  picture  of 
the  experimental  predictor  tests'  validity  for  predicting 
controller  performance,  it  is  important  to  have  reliable 
and  valid  measures  of  controller  job  performance.  That 
is,  a  concurrent  validation  study  involves  correlating 
predictor  scores  for  controllers  in  the  validation  sample 
with  criterion  performance  scores.  If  these  performance 
scores  are  not  reliable  and  valid,  our  inferences  about 
predictor  test  validities  are  likely  to  be  incorrect. 

The  job  of  air  traffic  controller  is  very  complex  and 
potentially  difficult  to  capture  in  a  criterion  develop¬ 
ment  effort.  Yet,  the  goal  here  was  to  develop  criterion 
measures  that  would  provide  a  comprehensive  picture  of 
controller  job  performance. 

Initial  job  analysis  work  suggested  a  model  of  perfor¬ 
mance  that  included  both  maximum  and  typical  perfor¬ 
mance  (Bobko,  Nickels,  Blair  &  Tartak,  1994;  Nickels, 
Bobko,  Blair,  Sands,  &  Tartak,  1995).  More  so  than 
with  many  jobs,  maximum  “can-do”  performance  is  very 
important  in  controlling  air  traffic.  There  are  times  on 
this  job  when  the  most  important  consideration  is  maxi¬ 
mum  performance  -  does  the  controller  have  the  techni¬ 
cal  skill  to  keep  aircraft  separated  under  very  difficult 
conditions?  Nonetheless,  typical  performance  over  time 
is  also  important  for  this  job. 

Based  on  a  task-based  job  analysis  (Nickels  et  al., 
1995),  a  critical  incidents  study  (Hedge,  Borman, 
Hanson,  Carter  &  Nelson,  1993),  and  past  research  on 


controller  performance  (e.g.,  Buckley,  O'Connor,  & 
Beebe,  1969;  Cobb,  1967),  we  began  to  formulate  ideas 
for  the  criterion  measures.  Hedge  et  al.  (1993)  discuss 
literature  that  was  reviewed  in  formulating  this  plan,  and 
summarize  an  earlier  version  of  the  criterion  plan.  Basi¬ 
cally,  this  plan  was  to  develop  multiple  measures  of 
controller  performance.  Each  of  these  measures  has 
strengths  for  measuring  performance,  as  well  as  certain 
limitations.  However,  taken  together,  we  believe  the 
measures  will  provide  a  valid  depiction  of  each  controller's 
job  performance.  The  plan  involved  developing  a  special 
situational  judgment  test  (called  the  Computer-Based 
Performance  Measure,  or  CBPM)  to  represent  the  maxi¬ 
mum  performance/technical  proficiency  part  of  the  job 
and  behavior-based  rating  scales  to  reflect  typical  perfor¬ 
mance.  A  high-fidelity  air  traffic  control  test  (the  High 
Fidelity  Performance  Measure,  HFPM)  was  also  to  be 
developed  to  investigate  the  construct  validity  of  the 
lower  fidelity  CBPM  with  a  subset  of  the  controllers  who 
were  administered  the  HFPM. 

The  Computer  Based  Performance  Measure 
(CBPM) 

The  goal  in  developing  the  CBPM  was  to  provide  a 
relatively  practical,  economical  measure  of  technical 
proficiency  that  could  be  administered  to  the  entire 
concurrent  validation  sample.  Practical  constraints  lim¬ 
ited  the  administration  of  the  higher  fidelity  measure 
(HFPM)  to  a  subset  of  the  validation  sample. 

Previous  research  conducted  by  Buckley  and  Beebe 
(1972)  suggested  that  scores  on  a  lower  fidelity  simula¬ 
tion  are  likely  to  correlate  with  scores  on  a  real  time, 
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hands-on  simulation  and  also  with  performance  ratings 
provided  by  peers  and  supervisors.  Their  motion  picture 
or  “CODE”  test,  presented  controllers  with  a  motion 
picture  of  a  radar  screen  and  asked  them  to  note  when 
there  were  potential  conflictions.  Buckley  and  Beebe 
reported  significant  correlations  between  CODE  scores 
and  for-research-only  ratings  provided  by  the  control¬ 
lers’  peers,  but  the  sample  size  in  this  research  was  only 
19.  Buckley,  O’Connor,  and  Beebe  ( 1 969)  also  reported 
that  correlations  between  CODE  scores  and  scores  on  a 
higher-fidelity  simulation  were  substantial,  the  highest 
correlation  was  .73,  but,  again,  the  sample  size  was  very 
small.  Finally,  Milne  and  Colmen  (1972)  found  a 
substantial  correlation  between  the  CODE  test  and  for- 
research-only  job  performance  ratings.  In  general,  re¬ 
sults  for  the  CODE  test  suggest  that  a  lower-fidelity 
simulation  can  capture  important  air  traffic  controller 
judgment  and  decision-making  skills. 

Again,  the  intention  in  the  present  effort  was  to 
develop  a  computerized  performance  test  that  as  closely 
as  possible  assessed  the  critical  technical  proficiency, 
separating-aircraft  part  of  the  controller  job.  Thus,  the 
target  performance  constructs  included  judgment  and 
decision  making  in  handling  air  traffic  scenarios,  proce¬ 
dural  knowledge  about  how  to  do  technical  tasks,  and 
“confliction  prediction”;  i.e.,  the  ability  to  know  when 
a  confliction  is  likely  to  occur  sometime  in  the  near 
future  if  nothing  is  done  to  address  the  situation. 

The  CBPM  was  patterned  after  the  situational  judg¬ 
ment  test  method.  The  basic  idea  was  to  have  an  air 
traffic  scenario  appear  on  the  computer  screen,  allow  a 
little  time  for  the  problem  to  evolve,  and  then  freeze  the 
screen  and  ask  the  examinee  a  multiple  choice  question 
about  how  to  respond  to  the  problem.  To  develop  this 
test,  we  trained  three  experienced  controllers  on  the 
situational  judgment  test  method  and  elicited  initial 
ideas  about  applying  the  method  to  the  air  traffic 
context. 

The  first  issue  in  developing  this  test  was  the  airspace 
in  which  the  test  would  be  staged.  There  is  a  great  deal 
of  controller  job  knowledge  that  is  unique  to  controlling 
traffic  in  a  specific  airspace  (e.g.,  the  map,  local  obstruc¬ 
tions).  Each  controller  is  trained  and  certified  on  the 
sectors  of  airspace  where  he  or  she  works.  Our  goal  in 
designing  the  CBPM  airspace  was  to  include  a  set  of 
airspace  features  (e.g.,  flight  paths,  airports,  special  use 
airspace)  sufficiently  complicated  to  allow  for  develop¬ 
ment  of  difficult,  realistic  situations  or  problems,  but  to 
also  keep  the  airspace  relatively  simple  because  it  is 
important  that  controllers  who  take  the  CBPM  can 


learn  these  features  very  quickly.  Figure  4.1  shows  the 
map  of  the  CBPM  airspace,  and  Figure  4.2  is  a  summary 
of  important  features  of  this  airspace  that  do  not  appear 
on  the  map. 

After  the  airspace  was  designed,  the  three  air  traffic 
controller  subject  matter  experts  (SMEs)  were  provided 
with  detailed  instructions  concerning  the  types  of  sce¬ 
narios  and  questions  appropriate  for  this  type  of  test. 
These  SMEs  then  developed  several  air  traffic  scenarios 
on  paper  and  multiple  choice  items  for  each  scenario. 
The  plan  was  to  generate  many  more  items  than  were 
needed  on  the  final  test,  and  then  select  a  subset  of  the 
best  items  later  in  the  test  development  process.  Also, 
based  on  the  job  analysis  (Nickels  et  al.,  1995)  a  list  of 
the  40  most  critical  en  route  controller  tasks  was  avail¬ 
able,  and  one  primary  goal  in  item  development  was  to 
measure  performance  in  as  many  of  these  tasks  as 
possible,  especially  those  that  were  rated  most  critical. 

At  this  stage,  each  scenario  included  a  map  depicting 
the  position  of  each  aircraft  at  the  beginning  of  the 
scenario,  flight  strips  that  provided  detailed  informa¬ 
tion  about  each  aircraft  (e.g.,  the  intended  route  of 
flight),  a  status  information  area  (describing  weather 
and  other  pertinent  background  information),  and  a 
script  describing  how  the  scenario  would  unfold.  This 
script  included  the  timing  and  content  of  voice  commu¬ 
nications  from  pilots  and/or  controllers,  radar  screen 
updates  (which  occur  every  10  seconds  in  the  en  route 
environment),  other  events  (e.g.,  hand-offs,  the  appear¬ 
ance  of  unidentified  radar  targets,  emergencies),  and  the 
exact  timing  and  wording  of  each  multiple  choice 
question  (along  with  possible  responses). 

After  the  controllers  had  independently  generated  a 
large  number  of  scenarios  and  items,  we  conducted 
discussion  sessions  in  which  each  SME  presented  his 
scenarios  and  items,  and  then  the  SMEs  and  researchers 
discussed  and  evaluated  these  items.  Discussion  in¬ 
cluded  topics  such  as  whether  all  necessary  information 
was  included,  whether  the  distractors  were  plausible, 
whether  or  not  there  were  “correct”  or  at  least  better 
responses,  whether  the  item  was  too  tricky  (i.e.,  choos¬ 
ing  the  most  effective  response  did  not  reflect  an  impor¬ 
tant  skill),  or  too  easy  (i.e.,  the  correct  response  was 
obvious),  and  whether  the  item  was  fair  for  all  facilities 
(e.g.,  might  the  item  be  answered  differently  at  different 
facilities  because  of  different  policies  or  procedures?).  As 
mentioned  previously,  the  CBPM  was  patterned  after 
the  situational  judgment  test  approach.  Unlike  other 
multiple  choice  tests,  there  was  not  necessarily  only  one 
correct  answer,  with  all  the  others  being  wrong.  Some 
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items  had,  for  example,  one  best  answer  and  one  or  two 
others  that  represented  fairly  effective  responses.  These 
test  development  sessions  resulted  in  30  scenarios  and 
99  items,  with  between  2  and  6  items  per  scenario. 

An  initial  version  of  the  test  was  then  programmed  to 
run  on  a  standard  personal  computer  with  a  17-inch 
high-resolution  monitor.  This  large  monitor  was  needed 
to  realistically  depict  the  display  as  it  would  appear  on 
an  en  route  radar  screen.  The  scenarios  were  initially 
programmed  using  a  “radar  engine,”  which  had  previ¬ 
ously  been  developed  for  the  FAA  for  training  purposes. 
This  program  was  designed  to  realistically  display  air¬ 
space  features  and  the  movement  of  aircraft.  After  the 
scenarios  were  programmed  into  the  radar  engine,  the 
SMEs  watched  the  scenarios  evolve  and  made  modifica¬ 
tions  as  necessary  to  meet  the  measurement  goals.  Once 
realistic  positioning  and  movement  of  the  aircraft  had 
been  achieved,  the  test  itself  was  programmed  using 
Authorware.  This  program  presented  the  radar  screens, 
voice  communications,  and  multiple  choice  questions, 
and  also  it  collected  the  multiple  choice  responses. 

Thus,  the  CBPM  is  essentially  self-administering 
and  runs  off  a  CD-ROM.  The  flight  strips  and  status 
information  areas  are  compiled  into  a  booklet,  with  one 
page  per  scenario,  and  the  airspace  summary  and  sector 
map  (see  Figures  4.1  and  4.2)  are  displayed  near  the 
computer  when  the  test  is  administered.  During  test 
administration,  controllers  are  given  60  seconds  to 
review  each  scenario  before  it  begins.  During  this  time, 
the  frozen  radar  display  appears  on  the  screen,  and 
examinees  are  allowed  to  review  the  flight  strips  and  any 
other  information  they  believe  is  relevant  to  that  par¬ 
ticular  scenario  (e.g.,  the  map  or  airspace  summary). 
Once  the  test  items  have  been  presented,  they  are  given 
25  seconds  to  answer  the  question.  This  is  analogous  to 
the  controller  job,  where  they  are  expected  to  “get  the 
picture”  concerning  what  is  going  on  in  their  sector  of 
airspace,  and  then  are  sometimes  required  to  react 
quickly  to  evolving  situations.  We  also  prepared  a 
training  module  to  familiarize  examinees  with  the  air¬ 
space  and  instructions  concerning  how  to  take  the  test. 

After  preparing  these  materials,  we  gathered  a  panel 
of  four  experienced  controllers  who  were  teaching  at  the 
FAA  Academy  and  another  panel  of  five  experienced 
controllers  from  the  field  to  review  the  scenarios  and 
items.  Specifically,  each  of  these  groups  was  briefed 
regarding  the  project,  trained  on  the  airspace,  and  then 
shown  each  of  the  scenarios  and  items.  Their  task  was  to 
rate  the  effectiveness  level  of  each  response  option. 
Ratings  were  made  independently  on  a  1-7  scale.  Table 


4.1  describes  the  controllers  who  participated  in  this 
initial  scaling  workshop,  and  Table  4.2  summarizes  the 
intraclass  correlation,  interrater  agreement  across  items 
for  the  two  groups.  After  this  initial  rating  session  with 
each  of  the  groups,  the  panel  members  compared  their 
independent  ratings  and  discussed  discrepancies.  In 
general,  two  different  outcomes  occurred  as  a  result  of 
these  discussions.  In  some  cases,  one  or  two  SMEs  failed 
to  notice  or  misinterpreted  part  of  the  item  (e.g. ,  did  not 
examine  an  important  flight  strip).  For  these  cases,  no 
changes  were  generally  made  to  the  item.  In  other  cases, 
there  was  a  legitimate  disagreement  about  the  effective¬ 
ness  of  one  or  more  response  options.  Here,  we  typically 
discussed  revisions  to  the  item  or  the  scenario  itself  that 
would  lead  to  agreement  between  panel  members  (with¬ 
out  making  the  item  overly  transparent).  In  addition, 
discussions  with  the  first  group  indicated  that  several 
items  were  too  easy  (i.e.,  the  answer  was  obvious) .  These 
items  were  revised  to  be  less  obvious.  Five  items  were 
dropped  because  they  could  not  be  satisfactorily  revised. 

These  ratings  and  subsequent  discussions  resulted  in 
substantial  revisions  to  the  CBPM.  The  revisions  were 
accomplished  in  preparation  for  a  final  review  of  the 
CBPM  by  a  panel  of  expert  SMEs.  For  this  final  review 
session,  12  controllers  from  the  field  were  identified 
who  had  extensive  experience  as  controllers  and  had 
spent  time  as  either  trainers  or  supervisors.  Characteristics 
of  this  final  scaling  panel  group  are  shown  in  Table  4. 1 . 

The  final  panel  was  also  briefed  on  the  project  and  the 
CBPM  and  then  reviewed  each  item.  To  ensure  that 
they  used  all  of  the  important  information  in  making 
their  ratings,  short  briefings  were  prepared  for  each 
item,  highlighting  the  most  important  pieces  of  infor¬ 
mation  that  affected  the  effectiveness  of  the  various 
responses.  Each  member  of  the  panel  then  indepen¬ 
dently  rated  the  effectiveness  level  of  each  response 
option.  This  group  did  not  review  each  other’s  ratings 
or  discuss  the  items. 

Interrater  agreement  data  appear  in  Table  4.2.  These 
results  show  great  improvement  because  in  the  final 
scaling  of  the  CBPM,  80  of  the  94  items  have  interrater 
reliability.  As  a  result  of  the  review,  5  items  were 
dropped  because  there  was  considerable  disagreement 
among  raters.  These  final  scaling  data  were  used  to  score 
the  CBPM.  For  each  item,  examinees  were  assigned  the 
mean  effectiveness  of  the  response  option  they  chose, 
with  a  few  exceptions.  First,  for  the  knowledge  items, 
there  was  only  one  correct  response.  Similarly,  for  the 
“confliction  prediction”  items,  there  was  one  correct 
response.  In  addition,  it  is  more  effective  to  predict  a 
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confliction  when  there  is  not  one  (i.e.,  be  conservative) 
than  to  fail  to  predict  a  confliction  when  there  is  one. 
Thus,  a  higher  score  was  assigned  for  an  incorrect 
conservative  response  than  an  incorrect  response  that 
predicted  no  confliction  when  one  would  have  oc¬ 
curred.  The  controller  SMEs  generated  rational  keys  for 
23  knowledge  and  confliction  prediction  type  items. 
Figure  4.3  shows  an  example  of  a  CBPM  item.  One  final 
revision  of  the  CBPM  was  made  based  on  pilot  test  data. 
The  pilot  test  will  be  discussed  in  a  later  section. 

The  Behavior  Summary  Scales 

The  intention  here  was  to  develop  behavior-based 
rating  scales  that  would  encourage  raters  to  make  evalu¬ 
ations  as  objectively  as  possible.  An  approach  to  accom¬ 
plish  this  is  to  prepare  scales  with  behavioral  statements 
anchoring  different  effectiveness  levels  on  each  dimen¬ 
sion  so  that  the  rating  task  is  to  compare  observed  ratee 
behavior  with  behavior  on  the  scale.  This  matching 
process  should  be  more  objective  than,  for  example, 
using  a  1  =  very  ineffective  to  7  =  very  effective  scale.  A 
second  part  of  this  approach  is  to  orient  and  train  raters 
to  use  the  behavioral  statements  in  the  manner  in¬ 
tended. 

The  first  step  in  scale  development  was  to  conduct 
workshops  to  gather  examples  of  effective,  mid-range, 
and  ineffective  controller  performance.  Four  such  work¬ 
shops  proceeded  with  controllers  teaching  at  the  FAA 
academy  and  with  controllers  at  the  Minneapolis  Cen¬ 
ter.  A  total  of  73  controllers  participated  in  the  work¬ 
shops;  they  generated  708  performance  examples. 

We  then  analyzed  these  performance  examples  and 
tentatively  identified  eight  relevant  performance  cat¬ 
egories;  (1)  Teamwork,  (2)  Coordinating,  (3)  Commu¬ 
nicating,  (4)  Monitoring,  (5)  Planning/Prioritizing,  (6) 
Separation,  (7)  Sequencing/Preventing  Delays,  and  (8) 
Reacting  to  Emergencies.  Preliminary  definitions  were 
developed  for  these  categories.  A  series  of  five  “mini¬ 
workshops”  were  subsequently  held  with  controllers  to 
review  the  categories  and  definitions.  This  iterative 
process,  involving  24  controllers,  refined  our  set  of 
performance  categories  and  definitions.  The  end  result 
was  a  set  of  ten  performance  categories.  These  final 
categories  and  their  definitions  are  shown  in  Table  4.3. 

Interestingly,  scale  development  work  to  this  point 
resulted  in  the  conclusion  that  these  ten  dimensions 
were  relevant  for  all  three  controller  options:  tower  cab, 
TRACON,  and  en  route.  However,  subsequent  work 
with  tower  cab  controllers  resulted  in  scales  with  some¬ 


what  different  behavioral  content.  Because  AT-SAT 
focused  on  en  route  controllers,  we  limit  our  discussion 
to  scale  development  for  that  group. 

The  next  step  was  to  “retranslate”  the  performance 
examples.  This  required  controller  SMEs  to  make  two 
judgments  for  each  example.  First,  they  assigned  each 
performance  example  to  one  (and  only  one)  perfor¬ 
mance  category.  Second,  the  controllers  rated  the  level 
of  effectiveness  (from  1  =  very  ineffective  to  7  =  very 
effective)  of  each  performance  example. 

Thus,  we  assembled  the  ten  performance  categories 
and  708  performance  examples  into  four  separate  book¬ 
lets  that  were  used  to  collect  the  SME  judgments  just 
discussed.  In  all,  booklets  were  administered  to  47  en 
route  controllers  at  three  sites  within  the  continental 
United  States.  Because  each  booklet  required  2-3  hours 
to  complete,  each  of  the  SMEs  was  asked  to  complete 
only  one  booklet.  As  a  result,  each  performance  example 
or  “item”  was  evaluated  by  9  to  20  controllers. 

Results  of  the  retranslation  showed  that  261  ex¬ 
amples  were  relevant  to  the  en  route  option,  were  sorted 
into  a  single  dimension  more  than  60%  of  the  time,  and 
had  standard  deviations  of  less  than  1.50  for  the  effec¬ 
tiveness  ratings.  These  examples  were  judged  as  provid¬ 
ing  unambiguous  behavioral  performance  information 
with  respect  to  both  dimension  and  effectiveness  level. 

Then  for  each  of  the  ten  dimensions,  the  perfor¬ 
mance  examples  belonging  to  that  dimension  were 
further  divided  into  high  effectiveness  (retranslated  at  5 
to  7),  middle  effectiveness  (3  to  5),  and  low  effectiveness 
(1-3).  Behavior  summary  statements  were  written  to 
summarize  ail  of  the  behavioral  information  reflected  in 
the  individual  examples.  In  particular,  two  or  occasion¬ 
ally  three  behavior  statements  for  each  dimension  and 
effectiveness  level  (i.e.,  high,  medium,  or  low)  were 
generated  from  the  examples.  Additional  rationale  for 
this  behavior  summary  scale  method  can  be  found  in 
Borman  (1979). 

As  a  final  check  on  the  behavior  summary  statements, 
we  conducted  a  retranslation  of  the  statements  using  the 
same  procedure  as  was  used  with  the  individual  ex¬ 
amples.  Seventeen  en  route  controllers  sorted  each  of  the 
87  statements  into  one  of  the  dimensions  and  rated  the 
effectiveness  level  reflected  on  a  1  -7  scale.  Results  of  this 
retranslation  can  be  found  in  Pulakos,  Keichel, 
Plamondon,  Hanson,  Hedge,  and  Borman  (1996).  Fi¬ 
nally,  for  those  statements  either  sorted  into  the  wrong 
dimensions  by  40%  or  more  of  the  controllers  or  re¬ 
translated  at  an  overly  high  or  low  effectiveness  level,  we 
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made  revisions  based  on  our  analysis  of  the  likely  reason 
for  the  retranslation  problem.  The  final  behavior  sum¬ 
mary  scales  appear  in  Appendix  C. 

Regarding  the  rater  orientation  and  training  pro¬ 
gram,  our  experience  and  previous  research  has  shown 
that  the  quality  of  performance  ratings  can  be  improved 
with  appropriate  rater  training  (e.g.,  Puiakos,  1984, 
1986;  Puiakos  &  Borman,  1986).  Over  the  past  several 
years,  we  have  been  refining  a  training  strategy  that  (1) 
orients  raters  to  the  rating  task  and  why  the  project 
requires  accurate  evaluations;  (2)  familiarizes  raters  with 
the  rating  dimensions  and  how  each  is  defined;  (3) 
teaches  raters  how  to  most  effectively  use  the  behavior 
summary  statements  to  make  objective  ratings;  (4) 
describes  certain  rater  errors  (e.g.,  halo)  in  simple, 
common-sense  terms  and  asks  raters  to  avoid  them;  and 
finally  (5)  encourages  raters  to  be  as  accurate  as  possible 
in  their  evaluations. 

For  this  application,  we  revised  the  orientation  and 
training  program  to  encourage  accurate  ratings  in  this 
setting.  In  particular,  a  script  was  prepared  to  be  used  by 
persons  administering  the  rating  scales  in  the  field. 
Appendix  D  contains  the  script.  In  addition,  a  plan  for 
gathering  rating  data  was  created.  Discussions  with 
controllers  in  the  workshops  described  earlier  suggested 
that  both  supervisors  and  peers  (i.e.,  fellow  controllers) 
would  be  appropriate  rating  sources.  Because  gathering 
ratings  from  relatively  large  numbers  of  raters  per  ratee 
is  advantageous  to  increase  levels  of  interrater  reliability, 
we  requested  that  two  supervisor  and  two  peer  raters  be 
asked  to  contribute  ratings  for  each  controller  ratee  in 
the  study.  Supervisor  and  peer  raters  were  identified 
who  had  worked  in  the  same  area  as  a  controller  for  at 
least  6  months  and  were  very  familiar  with  their  job 
performance.  For  practical  reasons  we  set  a  limit  of  5-6 
controllers  to  be  rated  by  any  individual  rater  in  the 
research.  The  rater  orientation  and  training  program 
and  the  plan  for  administering  the  ratings  in  the  field 
were  incorporated  into  a  training  module  for  those  profes¬ 
sionals  selected  to  conduct  the  data  collection.  That  train¬ 
ing  session  is  described  in  a  subsequent  section. 

The  High-Fidelity  Performance  Measure  (HFPM) 

Measuring  the  job  performance  of  air  traffic  control¬ 
lers  is  a  unique  situation  where  reliance  on  a  work 
sample  methodology  may  be  especially  applicable.  Use 
of  a  computer-generated  simulation  can  create  an  ATC 
environment  that  allows  the  controller  to  perform  in  a 
realistic  setting.  Such  a  simulation  approach  allows  the 
researcher  to  provide  high  levels  of  stimulus  and  re¬ 


sponse  fidelity  (Tucker,  1984).  Simulator  studies  of 
AT C  problems  have  been  reported  in  the  literature  since 
the  1950s.  Most  of  the  early  research  was  directed 
toward  the  evaluation  of  effects  of  workload  variables 
and  changes  in  control  procedures  on  overall  system  perfor¬ 
mance,  rather  than  focused  on  individual  performance 
assessment  (Boone,  Van  Buskirk,  and  Steen,  1980). 

However,  there  have  been  some  research  and  devel¬ 
opment  efforts  aimed  at  capturing  the  performance  of 
air  traffic  controllers,  including  Buckley,  O’Connor, 
Beebe,  Adams,  and  MacDonald  (1969),  Buckley, 
DeBaryshe,  Hitchner,  and  Kohn  (1983),  and 
Sollenberger,  Stein,  and  Gromelski  (1997).  For  ex¬ 
ample,  in  the  Buckley  et  al.  (1983)  study,  trained 
observers’  ratings  of  simulator  performance  were  found 
highly  related  to  various  aircraft  safety  and  expeditious¬ 
ness  measures.  Full-scale  dynamic  simulation  allows  the 
controller  to  direct  the  activities  of  a  sample  of  simulated 
air  traffic,  performing  characteristic  functions  such  as 
ordering  changes  in  aircraft  speed  or  flight  path,  but  within 
a  relatively  standardized  work  sample  framework. 

The  intention  of  the  HFPM  was  to  provide  an 
environment  that  would,  as  nearly  as  possible,  simulate 
actual  conditions  existing  in  the  controller’s  job.  One 
possibility  considered  was  to  test  each  controller  work¬ 
ing  in  his  or  her  own  facility’s  airspace.  This  approach 
was  eventually  rejected,  however,  because  of  the  prob¬ 
lem  of  unequal  difficulty  levels  across  facilities  and  even 
across  sectors  within  a  facility  (Borman,  Hedge,  <3c 
Hanson,  1992;  Hanson,  Hedge,  Borman,  &  Nelson, 
1993;  Hedge,  Borman,  Hanson,  Carter,  &:  Nelson, 
1993).  Comparing  the  performance  of  controllers  work¬ 
ing  in  environments  with  unequal  (and  even  unknown) 
difficulty  levels  is  extremely  problematic.  Therefore,  we 
envisioned  that  performance  could  be  assessed  using  a 
“simulated”  air  traffic  environment.  This  approach  was 
feasible  because  of  the  availability  at  the  FAA  Academy 
of  several  training  laboratories  equipped  with  radar 
stations  similar  to  those  found  in  the  field.  In  addition, 
they  use  a  generic  airspace  (Aero  Center)  designed  to 
allow  presentation  of  typical  air  traffic  scenarios  that 
must  be  controlled  by  the  trainee  (or  in  our  case,  the 
ratee).  Use  of  a  generic  airspace  also  allowed  for  stan¬ 
dardization  of  assessment.  See  Figure  4.4  for  a  visual 
depiction  of  the  Aero  Center  airspace. 

Thus,  through  use  of  the  Academy’s  radar  training 
facility  (RTF)  equipment,  in  conj  unction  with  the  Aero 
Center  generic  airspace,  we  were  able  to  provide  a  test 
environment  affording  the  potential  for  both  high  stimu¬ 
lus  and  response  fidelity.  Our  developmental  efforts 
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focused,  then,  on:  (1)  designing  and  programming 
specific  scenarios  in  which  the  controllers  would  control 
air  traffic;  and  (2)  developing  measurement  tools  for 
evaluating  controller  performance. 

Scenario  Development 

The  air  traffic  scenarios  were  designed  to  incorporate 
performance  constructs  central  to  the  controller’s  job, 
such  as  maintaining  aircraft  separation,  coordinating, 
communicating,  and  maintaining  situation  awareness. 
Also,  attention  was  paid  to  representing  in  the  scenarios 
the  most  important  tasks  from  the  task-based  job  analy¬ 
sis.  Finally,  it  was  decided  that,  to  obtain  variability  in 
controller  performance,  scenarios  should  be  developed 
with  either  moderate  or  quite  busy  traffic  conditions. 
Thus,  to  develop  our  HFPM  scenarios,  we  started  with 
a  number  of  pre-existing  Aero  Center  training  scenarios, 
and  revised  and  reprogrammed  to  the  extent  necessary 
to  include  relevant  tasks  and  performance  requirements 
with  moderate- to  high-intensity  traffic  scenarios.  In  all, 
16  scenarios  were  developed,  each  designed  to  run  no 
more  than  60  minutes,  inclusive  of  start-up,  position  relief 
briefing,  active  air  traffic  control,  debrief,  and  performance 
evaluation.  Consequently,  active  manipulation  of  air  traf¬ 
fic  was  limited  to  approximately  30  minutes. 

The  development  of  a  research  design  that  would 
allow  sufficient  time  for  both  training  and  evaluation 
was  critical  to  the  development  of  scenarios  and  accurate 
evaluation  of  controller  performance.  Sufficient  train¬ 
ing  time  was  necessary  to  ensure  adequate  familiarity 
with  the  airspace,  thereby  eliminating  differential  knowl¬ 
edge  of  the  airspace  as  a  contributing  factor  to  controller 
performance.  Adequate  testing  time  was  important  to 
ensure  sufficient  opportunity  to  capture  controller  per¬ 
formance  and  allow  for  stability  of  evaluation.  A  final 
consideration,  of  course,  was  the  need  for  controllers  in 
our  sample  to  travel  to  Oklahoma  City  to  be  trained  and 
evaluated.  With  these  criteria  in  mind,  we  arrived  at  a 
design  that  called  for  one-and  one-half  days  of  training, 
followed  by  one  full  day  of  performance  evaluation. 
This  schedule  allowed  us  to  train  and  evaluate  two 
groups  of  ratees  per  week. 

Development  of  Measurement  Instruments 

High-fidelity  performance  data  were  captured  by 
means  of  behavior-based  rating  scales  and  checklists, 
using  trainers  with  considerable  air  traffic  controller 
experience  or  current  controllers  as  raters.  Development 
and  implementation  of  these  instruments,  and  selection 
and  training  of  the  HFPM  raters  are  discussed  below. 


It  was  decided  that  controller  performance  should  be 
evaluated  across  broad  dimensions  of  performance,  as 
well  as  at  a  more  detailed  step-by-step  level.  Potential 
performance  dimensions  for  a  set  of  rating  scales  were 
identified  through  reviews  of  previous  literature  involv¬ 
ing  air  traffic  controllers,  existing  on-the-job-training 
forms,  performance  verification  forms,  and  current  AT- 
SAT  work  on  the  development  of  behavior  summary 
scales.  The  over-the-shoulder  (OTS)  nature  of  this 
evaluation  process,  coupled  with  the  maximal  perfor¬ 
mance  focus  of  the  high-fidelity  simulation  environ¬ 
ment,  required  the  development  of  rating  instruments 
designed  to  facilitate  efficient  observation  and  evalua¬ 
tion  of  performance. 

After  examining  several  possible  scale  formats,  we 
chose  a  7-point  effectiveness  scale  for  the  OTS  form, 
with  the  scale  points  clustered  into  three  primary  effec¬ 
tiveness  levels;  i.e.,  below  average  (1  or  2),  fully  adequate 
(3,  4,  or  5),  and  exceptional  (6  or  7).  Through  consul¬ 
tation  with  controllers  currently  working  as  Academy 
instructors,  we  tentatively  identified  eight  performance 
dimensions  and  developed  behavioral  descriptors  for 
these  dimensions  to  help  provide  a  frame-of-reference 
for  the  raters.  The  eight  dimensions  were:  ( 1 )  Maintain¬ 
ing  Separation;  (2)  Maintaining  Efficient  Air  Traffic 
Flow;  (3)  Maintaining  Attention  and  Situation  Aware¬ 
ness;  (4)  Communicating  Clearly,  Accurately,  and  Con¬ 
cisely;  (5)  Facilitating  Information  Flow;  (6) 
Coordinating;  (7)  Performing  Multiple  Tasks;  and,  (8) 
Managing  Sector  Workload.  We  also  included  an  over¬ 
all  performance  category.  As  a  result  of  rater  feedback 
subsequent  to  pilot  testing  (described  later  in  this  chap¬ 
ter),  “Facilitating  Information  Flow”  was  dropped  from 
the  form.  This  was  due  primarily  to  perceived  overlap 
between  this  dimension  and  several  others,  including 
Dimensions  3, 4,  6,  and  7.  The  OTS  form  can  be  found 
in  Appendix  E. 

A  second  instrument  required  the  raters  to  focus  on 
more  detailed  behaviors  and  activities,  and  note  whether 
and  how  often  each  occurred.  A  “Behavioral  and  Events 
Checklist”  (BEC)  was  developed  for  use  with  each 
scenario.  The  BEC  required  raters  to  actively  observe 
the  ratees  controlling  traffic  during  each  scenario  and 
note  behaviors  such  as:  (1)  failure  to  accept  hand-offs, 
coordinate  pilot  requests,  etc.;  (2)  letters  of  agreement 
(LO A) /directive  violations;  (3)  readback/hearback  er¬ 
rors;  (4)  unnecessary  delays;  (5)  incorrect  information 
input  into  thecomputer;  and,  (6)  late  frequency  changes. 
Raters  also  noted  operational  errors  and  deviations.  The 
BEC  form  can  be  found  in  Appendix  F. 
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Rater  Training 

Fourteen  highly  experienced  controllers  from  field 
units  or  currently  working  as  instructors  at  the  FAA 
Academy  were  detailed  to  the  AT-SAT  project  to  serve 
as  raters  for  the  HFPM  portion  of  the  project.  Raters 
arrived  approximately  three  weeks  before  the  start  of 
data  collection  to  allow  time  for  adequate  training  and 
pilot  testing.  Thus,  our  rater  training  occurred  over  an 
extended  period  of  time,  affording  an  opportunity  for 
ensuring  high  levels  of  rater  calibration. 

During  their  first  week  at  the  Academy,  raters  were 
exposed  to  (1)  a  general  orientation  to  the  AT-SAT 
project,  its  purposes  and  objectives,  and  the  importance 
of  the  high-fidelity  component;  (2)  airspace  training; 
(3)  the  HFPM  instruments;  (4)  all  supporting  materials 
(such  as  Letters  of  Agreement,  etc.);  (5)  training  and 
evaluation  scenarios;  and  (6)  rating  processes  and  pro¬ 
cedures.  The  training  program  was  an  extremely  hands- 
on,  feedback  intensive  process.  During  this  first  week 
raters  served  as  both  raters  and  ratees,  controlling  traffic 
in  each  scenario  multiple  times,  as  well  as  serving  as 
raters  of  their  associates  who  took  turns  as  ratees.  This 
process  allowed  raters  to  become  extremely  familiar 
with  both  the  scenarios  and  evaluation  of  performance 
in  these  scenarios.  With  multiple  raters  evaluating  per¬ 
formance  in  each  scenario,  project  personnel  were  able 
to  provide  immediate  critique  and  feedback  to  raters, 
aimed  at  improving  accuracy  and  consistency  of  rater 
observation  and  evaluation. 

In  addition,  prior  to  rater  training,  we  “scripted” 
performances  on  several  scenarios,  such  that  deliberate 
errors  were  made  at  various  points  by  the  individual 
controlling  traffic.  Raters  were  exposed  to  these  “scripted” 
scenarios  early  in  the  training  so  as  to  more  easily 
facilitate  discussion  of  specific  types  of  controlling 
errors.  A  standardization  guide  was  developed  with  the 
cooperation  of  the  raters,  such  that  rules  for  how  ob¬ 
served  behaviors  were  to  be  evaluated  could  be  referred 
to  during  data  collection  if  any  questions  arose  (see 
Appendix  G).  All  of  these  activities  contributed  to 
enhanced  rater  calibration. 

Pilot  Tests  of  the  Performance  Measures 

The  plan  was  to  pilot  test  the  CBPM  and  the  perfor¬ 
mance  rating  program  at  two  Air  Route  Traffic  Control 
Centers  (ARTCCs),  Seattle  and  Salt  Lake  City.  The 
HFPM  was  to  be  pilot  tested  in  Oklahoma  City.  All 
materials  were  prepared  for  administration  of  the  CBPM 
and  ratings,  and  two  criterion  research  teams  proceeded 


to  the  pilot  test  sites.  In  general,  procedures  for  admin¬ 
istering  these  two  assessment  measures  proved  to  be 
effective.  Data  were  gathered  on  a  total  of  77  controllers 
at  the  two  locations.  Test  administrators  asked  pilot  test 
participants  for  their  reactions  to  the  CBPM,  and  many 
of  them  reported  that  the  situations  were  realistic  and 
like  those  that  occurred  on  their  jobs. 

Results  for  the  CBPM  are  presented  in  Table  4.4. 
The  distribution  of  total  scores  was  promising  in  the 
sense  that  there  was  variability  in  the  scores.  The  coef¬ 
ficient  alpha  was  moderate,  as  we  might  expect  from  a 
test  that  is  likely  mutidimensional.  Results  for  the 
ratings  are  shown  in  Tables  4.5  and  4.6.  First,  we  were 
able  to  approach  our  target  of  two  supervisors  and  two 
peers  for  each  ratee.  A  mean  of  1 .24  supervisors  and  1 .30 
peers  per  ratee  participated  in  the  rating  program.  In 
addition,  both  the  supervisor  and  peer  ratings  had 
reasonable  degrees  of  variability.  Also,  the  interrater 
reliabilities  (intraclass  correlations)  were,  in  general, 
acceptable.  The  Coordinating  dimension  is  an  excep¬ 
tion.  When  interrater  reliabilities  were  computed  across 
the  supervisor  and  peer  sources,  they  ranged  from  .37  to 
.62  with  a  median  of  .54.  Thus,  reliability  improves 
when  both  sources’  data  are  used. 

In  reaction  to  the  pilot  test  experience,  we  modified 
the  script  for  the  rater  orientation  and  training  program. 
We  decided  to  retain  the  Coordinating  dimension  for 
the  main  study,  with  the  plan  that  if  reliability  contin¬ 
ued  to  be  low  we  might  not  use  the  data  for  that 
dimension.  With  the  CBPM,  one  item  was  dropped 
because  it  had  a  negative  item-total  score  correlation. 
That  is,  controllers  who  answered  this  item  correctly 
tended  to  have  low  total  CBPM  scores. 

The  primary  purpose  of  the  HFPM  pilot  test  was  to 
determine  whether  our  rigorous  schedule  of  one-and 
one-half  days  of  training  and  one  day  of  evaluation  was 
feasible  administratively.  Our  admittedly  ambitious 
design  required  completion  of  up  to  eight  practice 
scenarios  and  eight  graded  scenarios.  Start-up  and  shut¬ 
down  of  each  computer-generated  scenario  at  each  radar 
station,  setup  and  breakdown  of  associated  flight  strips, 
pre-and  post-position  relief  briefings,  and  completion 
of  OTS  ratings  and  checklists  all  had  to  be  accomplished 
within  the  allotted  time,  for  all  training  and  evaluation 
scenarios.  Thus,  smooth  coordination  and  timing  of 
activities  was  essential.  Prior  to  the  pilot  test,  prelimi¬ 
nary  “dry  runs”  had  already  convinced  us  to  eliminate 
one  of  the  eight  available  evaluation  scenarios,  due  to 
time  constraints. 
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Six  experienced  controllers  currently  employed  as 
instructors  at  the  FAA  Academy  served  as  our  ratees  for 
the  pilot  test.  They  were  administered  the  entire  two- 
and  one-half  day  training/evaluation  process,  from  ori¬ 
entation  through  final  evaluation  scenarios.  As  a  result 
of  the  pilot  test,  and  in  an  effort  to  increase  the  efficiency 
of  the  process,  minor  revisions  were  made  to  general 
administrative  procedures.  However,  in  general,  proce¬ 
dures  for  administering  the  HFPM  proved  to  be  effec¬ 
tive;  all  anticipated  training  and  evaluation  requirements 
were  completed  on  time  and  without  major  problems. 

In  addition  to  this  logistical,  administration  focus  of 
the  pilot  test,  we  also  examined  the  consistency  of 
ratings  by  our  HFPM  raters.  Two  raters  were  assigned 
to  each  ratee,  and  the  collection  of  HFPM  data  by  two 
raters  for  each  ratee  across  each  of  the  seven  scenarios 
allowed  us  to  check  for  rater  or  scenario  peculiarities. 

Table  4.7  presents  correlations  between  ratings  for 
rater  pairs  both  across  scenarios  and  within  each  sce¬ 
nario,  and  suggested  that  Scenarios  2  and  7  should  be 
examined  more  closely,  as  well  as  three  OTS  dimensions 
(Communicating  Clearly,  Accurately,  and  Efficiently; 
Facilitating  Information  Flow;  and  Coordination).  To 
provide  additional  detail,  we  also  generated  a  table 
showing  magnitude  of  effectiveness  level  differences 
between  each  rater  pair  for  each  dimension  on  each 
scenario  (see  Appendix  H). 

Examination  of  these  data  and  discussion  with  our 
raters  helped  us  to  focus  on  behaviors  or  activities  in  the 
two  scenarios  that  led  to  ambiguous  ratings  and  to 
subsequently  clarify  these  situations.  Discussions  con¬ 
cerning  these  details  with  the  raters  also  allowed  us  to 
identify  specific  raters  in  need  of  more  training.  Finally, 
extensive  discussion  surrounding  the  reasons  for  lower 
than  expected  correlations  on  the  three  dimensions 
generated  the  conclusion  that  excessive  overlap  between 
the  three  dimensions  generated  confusion  as  to  where  to 
represent  the  observed  performance.  As  a  result,  the 
“Facilitating  Information  Flow”  dimension  was  dropped 
from  the  OTS  form. 

Training  the  Test  Site  Managers 

Our  staff  prepared  a  manual  describing  data  collec¬ 
tion  procedures  for  the  criterion  measures  during  the 
concurrent  validation  and  conducted  a  half-day  train¬ 
ing  session  on  how  to  collect  criterion  data  in  the  main 
sample.  We  reviewed  the  CBPM,  discussed  administra¬ 
tion  issues,  and  described  procedures  for  handling  prob¬ 
lems  (e.g.,  what  to  do  when  a  computer  malfunctions  in 


mid-scenario).  Test  site  managers  had  an  opportunity  to 
practice  setting  up  the  testing  stations  and  review  the 
beginning  portion  of  the  test.  They  were  also  briefed  on 
the  performance  rating  program.  We  described  proce¬ 
dures  for  obtaining  raters  and  training  them.  The  script 
for  training  raters  was  thoroughly  reviewed  and  ratio¬ 
nale  for  each  element  of  the  training  was  provided. 
Finally,  we  answered  all  of  the  test  site  managers’ 
questions.  These  test  site  managers  hired  and  trained 
data  collection  staff  at  their  individual  testing  locations. 
There  were  a  total  of  20  ARTCCs  that  participated  in  the 
concurrent  validation  study  (both  Phase  1  and  Phase  2). 

Data  Collection 

CBPM  data  were  collected  for  1046  controllers. 
Performance  ratings  for  1 227  controllers  were  provided 
by  535  supervisor  and  1 420  peer  raters.  Table  4.8  below 
shows  the  number  of  supervisors  and  peers  rating  each 
controller.  CBPM  and  rating  data  were  available  for 
1043  controllers. 

HFPM  data  were  collected  for  1 07  controllers.  This 
sample  was  a  subset  of  the  main  sample  so  1 07  control¬ 
lers  had  data  for  the  CBPM,  the  ratings,  and  the  HFPM. 
In  particular,  controllers  from  the  main  sample  arrived 
in  Oklahoma  City  from  12  different  air  traffic  facilities 
throughout  the  U.S.  to  participate  in  the  two-and  one- 
half  day  HFPM  process.  The  one-and  one-half  days  of 
training  consisted  of  four  primary  activities:  orienta¬ 
tion,  airspace  familiarization  and  review,  airspace  certi¬ 
fication  testing,  and  scenarios  practice.  To  accelerate 
learning  time,  a  hard  copy  and  computer  disk  describing 
the  airspace  had  been  developed  and  sent  to  controllers 
at  their  home  facility  for  “preread”  prior  to  arrival  in 
Oklahoma  City. 

Each  controller  was  then  introduced  to  the  Radar 
Training  Facility  (RTF)  and  subsequently  completed 
two  practice  scenarios.  After  completion  of  the  second 
scenario  and  follow-up  discussions  about  the  experi¬ 
ence,  the  controllers  were  required  to  take  an  airspace 
certification  test.  The  certification  consisted  of  70  recall 
and  recognition  items  designed  to  test  knowledge  of 
Aero  Center.  Those  individuals  not  receiving  a  passing 
grade  (at  least  70%  correct)  were  required  to  retest  on 
that  portion  of  the  test  they  did  not  pass.  The  107 
controllers  scored  an  average  of  94%  on  the  test,  with 
only  7  failures  (6.5%)  on  the  first  try.  All  controllers 
subsequently  passed  the  retest  and  were  certified  by  the 
trainers  to  advance  to  the  remaining  day  of  formal 
evaluation. 
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After  successful  completion  of  the  air  traffic  test,  each 
controller  received  training  on  six  additional  air  traffic 
scenarios.  During  this  time,  the  raters  acted  as  trainers 
and  facilitated  the  ratee’s  learning  of  the  airspace.  While 
questions  pertaining  to  knowledge  of  airspace  and  re¬ 
lated  regulations  were  answered  by  the  raters,  coaching 
ratees  on  how  to  more  effectively  and  efficiently  control 
traffic  was  prohibited. 

After  the  eight  training  scenarios  were  completed,  all 
ratees5  performance  was  evaluated  on  each  of  seven 
scenarios  that  together  required  approximately  8  hours 
to  complete.  The  seven  scenarios  consisted  of  four 
moderately  busy  and  three  very  busy  air  traffic  condi¬ 
tions,  increasing  in  complexity  from  Scenario  1  to 
Scenario  7.  During  this  8  hour  period  of  evaluation, 
raters  were  randomly  assigned  to  ratees  before  each 
scenario,  with  the  restriction  that  a  rater  should  not  be 
assigned  to  a  ratee  (1)  from  the  raters  home  facility;  or 
(2)  ifhe/shewas  the  raters  training  scenario  assignment. 

While  ratees  were  controlling  traffic  in  a  particular 
scenario,  raters  continually  observed  and  noted  perfor¬ 
mance  using  the  BEC.  After  the  scenario  ended,  each 
rater  completed  the  OTS  ratings.  In  all,  1 1  training/ 
evaluation  sessions  were  conducted  within  a  7-week 
period.  During  four  of  these  sessions,  a  total  of  24  ratees 
were  evaluated  by  two  raters  at  a  time,  while  a  single  rater 
evaluated  ratee  performance  during  the  other  seven 
sessions. 

Results 

CBPM 

Table  4.9  shows  the  distribution  of  CBPM  scores.  As 
with  the  pilot  sample,  there  is  a  reasonable  amount  of 
variability.  Also,  item-total  score  correlations  range 
from  .01  to  .27  (mean  =  .11).  The  coefficient  alpha  was 
.63  for  this  84-item  test.  The  relatively  low  item-total 
correlations  and  the  modest  coefficient  alpha  suggest  that 
the  CBPM  is  measuring  more  than  a  single  construct. 

Supervisor  and  Peer  Ratings 

In  Tables  4.10  and  4.11,  the  number  and  percent  of 
ratings  at  each  scale  point  are  depicted  for  supervisors 
and  peers  separately.  A  low  but  significant  percentage  of 
ratings  are  at  the  1 ,  2,  or  3  level  for  both  supervisor  and 
peer  ratings.  Most  of  the  ratings  fall  at  the  4-7  level,  but 
overall,  the  variability  is  reasonable  for  both  sets  of  ratings. 

Table  4. 12  contains  the  interrater  reliabilities  for  the 
supervisor  and  peer  ratings  separately  and  for  the  two 
sets  of  ratings  combined.  In  general,  the  reliabilities  are 
quite  high.  The  supervisor  reliabilities  are  higher  than 


the  peer  reliabilities,  but  the  differences  are  for  the  most 
part  very  small.  Importantly,  the  combined  supervisor/ 
peer  ratings  reliabilities  are  substantially  higher  than  the 
reliabilities  for  either  source  alone.  Conceptually,  it 
seems  appropriate  to  get  both  rating  sources5  perspec¬ 
tives  on  controller  performance.  Supervisors  typically 
have  more  experience  evaluating  performance  and  have 
seen  more  incumbents  perform  in  the  job;  peers  often 
work  side-by-side  with  the  controllers  they  are  rating, 
and  thus  have  good  first-hand  knowledge  of  their  per¬ 
formance.  The  result  of  higher  reliabilities  for  the  com¬ 
bined  ratings  makes  an  even  more  convincing  argument 
for  using  both  rating  sources. 

Scores  for  each  ratee  were  created  by  computing  the 
mean  peer  and  mean  supervisor  rating  for  each  dimen¬ 
sion.  Scores  across  peer  and  supervisor  ratings  were  also 
computed  for  each  ratee  on  each  dimension  by  taking 
the  mean  of  the  peer  and  supervisor  scores.  Table  4. 13 
presents  the  means  and  standard  deviations  for  these 
rating  scores  on  each  dimension,  supervisors  and  peers 
separately,  and  the  two  sources  together.  The  means  are 
higher  for  the  peers  (range  =  5.03-5.46),  but  the  stan¬ 
dard  deviations  for  that  rating  source  are  generally 
almost  as  high  as  those  for  the  supervisor  raters. 

Table  4.14  presents  the  intercorrelations  between 
supervisor  and  peer  ratings  on  all  of  the  dimensions. 
First,  within  rating  source,  the  between-dimension  cor¬ 
relations  are  large.  This  is  common  with  rating  data. 
And  second,  the  supervisor-peer  correlations  for  the 
same  dimensions  (e.g.,  Communicating  =  .39)  are  at 
least  moderate  in  size,  again  showing  reasonable  agree¬ 
ment  across-source  regarding  the  relative  levels  of  effec¬ 
tiveness  for  the  different  controllers  rated. 

The  combined  supervisor/peer  ratings  were  factor 
analyzed  to  explore  the  dimensionality  of  the  ratings. 
This  analysis  addresses  the  question,  is  there  a  reason¬ 
able  way  of  summarizing  the  10  dimensions  with  a 
smaller  number  of  composite  categories?  The  3-factor 
solution,  shown  in  Table  4.15,  proved  to  be  the  most 
interpretable.  The  first  factor  was  called  Technical 
Performance,  with  Dimensions,  1 , 3, 6, 7,  and  8  prima¬ 
rily  defining  the  factor.  Technical  Effort  was  the  label 
for  Factor  2,  with  Dimensions  2,  4,  5,  and  9  as  the 
defining  dimensions.  Finally,  Factor  3  was  defined  by  a 
single  dimension  and  was  called  Teamwork. 

Although  the  3-factor  solution  was  interpretable, 
keeping  the  three  criterion  variables  separate  for  the 
validation  analyses  seemed  problematic.  This  is  because 
(1)  the  variance  accounted  for  by  the  factors  is  very 
uneven  (82%  of  the  common  variance  is  accounted  for 
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by  the  first  factor);  (2)  the  correlations  between  unit- 
weighted  composites  representing  the  first  two  factors  is 
.78;  correlations  between  each  of  these  composites  and 
Teamwork  are  high  as  well  (.60  and  .63  respectively); 
and  (3)  all  but  one  of  the  10  dimensions  loads  on  a 
technical  performance  factor,  so  it  seemed  somewhat 
inappropriate  to  have  the  one-dimension  Teamwork 
variable  representing  1/3  of  the  rating  performance 
domain. 

Accordingly,  we  formed  a  single  rating  variable  rep¬ 
resented  by  a  unit-weighted  composite  of  ratings  on  the 
1 0  dimensions.  The  interrater  reliability  of  this  compos¬ 
ite  is  .71  for  the  combined  supervisor  and  peer  rating 
data.  This  is  higher  than  the  reliabilities  for  individual 
dimensions.  This  would  be  expected,  but  it  is  another 
advantage  of  using  this  summary  rating  composite  to 
represent  the  rating  data. 

HFPM 

Table  4.16  contains  descriptive  statistics  for  the 
variables  included  in  both  of  the  rating  instruments 
used  during  the  HFPM  graded  scenarios.  For  the  OTS 
dimensions  and  the  BEC,  the  scores  represent  averages 
across  each  of  the  seven  graded  scenarios. 

The  means  of  the  individual  performance  dimen¬ 
sions  from  the  7-point  OTS  rating  scale  are  in  the  first 
section  of  Table  4.16  (Variables  1  through  7).  They 
range  from  a  low  of  3.66  for  Maintaining  Attention  and 
Situation  Awareness  to  a  high  of4.6l  for  Co  mm  u  n  icating 
Clearly ,  Accurately  and  Efficiently.  The  scores  from  each 
of  the  performance  dimensions  are  slightly  negatively 
skewed,  but  are  for  the  most  part,  normally  distributed. 

Variables  8  through  1 6  in  Table  4. 1 6  were  collected 
using  the  BEC.  To  reiterate,  these  scores  represent 
instances  where  the  controllers  had  either  made  a  mis¬ 
take  or  engaged  in  some  activity  that  caused  a  dangerous 
situation,  a  delay,  or  in  some  other  way  impeded  the 
flow  of  air  traffic  through  their  sector.  For  example,  a 
Letter  of  Agreement  ( L  OA )/ Directive  Violation  was  j  udged 
to  have  occurred  if  a  jet  was  not  established  at  250  knots 
prior  to  crossing  the  appropriate  arrival  fix  or  if  a 
frequency  change  was  issued  prior  to  completion  of  a 
handoff  for  the  appropriate  aircraft.  On  average,  each 
participant  had  2.42  LOA/Directive  Violations  in  each 
scenario. 

Table  4.17  contains  interrater  reliabilities  for  the 
OTS  Ratings  for  those  24  ratees  for  whom  multiple  rater 
information  was  available.  Overall,  the  interrater 
reliabilities  were  quite  high  for  the  OTS  ratings,  with 


median  interrater  reliabilities  ranging  from  a  low  of  .83 
for  Maintaining  Attention  and  Situation  Awareness  to  a 
high  of  .95  for  Maintaining  Separation.  In  addition, 
these  OTS  dimensions  were  found  to  be  highly 
intercorrelated  (median  r  =  .91).  Because  of  the  high 
levels  of  dimension  intercorrelation,  an  overall  compos¬ 
ite  will  be  used  in  future  analyses. 

All  relevant  variables  for  the  OTS  and  BEC  measures 
were  combined  and  subjected  to  an  overall  principal 
components  analysis  to  represent  a  final  high-fidelity 
performance  criterion  space.  The  resulting  two-  factor 
solution  is  presented  in  Table  4.18.  The  first  compo¬ 
nent,  Overall  Technical  Proficiency,  consists  of  the  OTS 
rating  scales,  plus  the  operational  error,  operational 
deviation,  and  LOA/Directive  violation  variables  from 
the  BEC.  The  second  component  is  defined  by  six 
additional  BEC  variables  and  represent  a  sector  manage¬ 
ment  component  of  controller  performance.  More  spe¬ 
cifically,  this  factor  represents  Poor  Sector  Management, 
whereby  the  controllers  more  consistently  make  late 
frequency  changes,  fail  to  accept  hand-offs,  commit 
readback/hearback  errors,  fail  to  accommodate  pilot 
requests,  delay  aircraft  unnecessarily,  and  enter  incor¬ 
rect  information  in  the  computer.  This  interpretation  is 
reinforced  by  the  strong  negative  correlation  (-.72) 
found  between  Overall  Technical  Proficiency  and  Poor 
Sec  to  r  Ma  nagem  en  t. 

Correlations  Between  the  Criterion  Measures: 

Construct  Validity  Evidence 

Table  4.19  depicts  the  relationships  between  scores 
on  the  84-item  CBPM,  the  two  HFPM  factors,  and  the 
combined  supervisor/peer  ratings.  First,  the  correlation 
between  the  CBPM  total  scores  and  the  HFPM  Factor 
1 ,  arguably  our  purest  measure  of  technical  proficiency, 
is  .54.  This  provides  strong  evidence  for  the  construct 
validity  of  the  CBPM.  Apparently,  this  lower  fidelity 
measure  of  technical  proficiency  is  tapping  much  the 
same  technical  skills  as  the  HFPM,  which  had  control¬ 
lers  working  in  an  environment  highly  similar  to  their 
actual  job  setting.  In  addition,  a  significant  negative 
correlation  exists  between  the  CBPM  and  the  second 
HFPM  factor,  Poor  Sector  Management. 

Considerable  evidence  for  the  construct  validity  of 
the  ratings  is  also  evident.  Correlations  between  the 
ratings  and  the  first  HFPM  factor  is  .40.  Thus,  the 
ratings,  containing  primarily  technical  proficiency-ori¬ 
ented  content,  correlate  substantially  with  our  highest 
fidelity  measure  of  technical  proficiency.  The  ratings 
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also  correlate  significantly  with  the  second  HFPM 
factor  (r  =  -.28),  suggesting  the  broad-based  coverage  of 
the  criterion  space  toward  which  the  ratings  were  tar¬ 
geted.  Finally,  the  ratings-CBPM  correlation  is  .22, 
suggesting  that  the  ratings  also  share  variance  associated 
with  the  judgment,  decision-making,  and  procedural 
knowledge  constructs  we  believe  the  CBPM  is  measur¬ 
ing.  This  suggests  that,  as  intended,  the  ratings  on  the 
first  two  categories  are  measuring  the  typical  perfor¬ 
mance  component  of  technical  proficiency. 

Overall,  there  is  impressive  evidence  that  the  CBPM 
and  the  ratings  are  measuring  the  criterion  domains  they 
were  targeted  to  measure.  At  this  point,  and  as  planned, 
we  examined  individual  CBPM  items  and  their  rela¬ 
tions  to  the  other  criteria,  with  the  intention  of  drop¬ 
ping  items  that  were  not  contributing  to  the  desired 
relationships.  For  this  step,  we  reviewed  the  item-total 
score  correlations,  and  CBPM  item  correlations  with 
HFPM  scores  and  the  rating  categories.  Items  with  very 
low  or  negative  correlations  with:  (1)  total  CBPM 
scores;  (2)  the  HFPM  scores,  especially  for  the  first 
factor;  and  (3)  the  rating  composite  were  considered  for 
exclusion  from  the  final  CBPM  scoring  system.  Also 
considered  were  the  links  to  important  tasks.  The  link¬ 
age  analysis  is  described  in  a  later  section.  Items  repre¬ 
senting  one  or  more  highly  important  tasks  were  given 
additional  consideration  for  inclusion  in  the  final  com¬ 
posite.  These  criteria  were  applied  concurrently  and  in 
a  compensatory  manner.  Thus,  for  example,  a  quite  low 
item-total  score  correlation  might  be  offset  by  a  high 
correlation  with  HFPM  scores. 

This  item  review  process  resulted  in  38  items  being 
retained  for  the  final  CBPM  scoring  system.  The  result¬ 
ing  CBPM  composite  has  a  coefficient  alpha  of  .61  and 
correlates  .61  and  -.42  with  the  two  HFPM  factors,  and 
.24  with  the  rating  composite.  Further,  coverage  of  the 
40  most  important  tasks  is  at  approximately  the  same 
level,  with  all  but  one  covered  by  at  least  one  CBPM 
item.  Thus,  the  final  composite  is  related  more  strongly 
to  the  first  HFPM  factor,  and  correlates  a  bit  more 
highly  with  the  technically-oriented  rating  composite. 
We  believe  this  final  CBPM  composite  has  even  better 
construct  validity  in  relation  to  the  other  criterion 
measures  than  did  the  total  test. 

Additional  Construct  Validity  Evidence 

Hedge  et  al.  (1993)  discuss  controller  performance 
measures  that  are  currently  collected  and  maintained  by 
the  FAA  and  the  issues  in  using  these  measures  as  criteria 


in  the  validation  of  controller  predictor  measures.  Some 
of  the  more  promising  archival  measures  are  those 
related  to  training  performance,  especially  the  time  to 
complete  various  phases  of  training  and  ratings  of 
performance  in  these  training  phases.  However,  there 
are  some  serious  problems  even  with  these  most  prom¬ 
ising  measures  (e.g.,  standardization  across  facilities, 
measures  are  not  available  for  all  controllers).  Thus,  our 
approach  in  the  present  effort  was  to  use  these  measures 
to  further  evaluate  the  construct  validity  of  the  AT-SAT 
criterion  measures. 

In  general,  training  performance  has  been  shown  to 
be  a  good  predictor  of  job  performance,  so  measures  of 
training  performance  should  correlate  with  the  AT- 
SAT  measures  of  job  performance.  Training  perfor¬ 
mance  data  were  available  for  809  of  the  1227  controllers 
in  the  concurrent  validation  sample.  Two  of  the  on-the- 
job  training  phases  (Phase  6  and  Phase  9)  are  reasonably 
standardized  across  facilities,  so  performance  measures 
from  these  two  phases  are  good  candidates  for  use  as 
performance  measures.  We  examined  the  correlation 
between  ratings  of  performance  across  these  two  phases 
and  the  correlations  between  five  variables  measuring 
training  time  (hours  and  days  to  complete  training  at 
each  phase).  The  rating  measures  did  not  even  correlate 
significantly  with  each  other,  and  thus  were  not  in¬ 
cluded  in  further  analyses.  Correlations  between  the 
training  time  variables  were  higher.  Because  the  time 
variables  appeared  to  be  tapping  similar  performance 
dimensions,  we  standardized  and  added  these  measures 
to  create  a  “training  time”  scale.  Controllers  with  less 
than  four  out  of  the  five  variables  measuring  training 
time  were  removed  from  further  analyses  (N=751). 
Correlations  between  training  time  and  ratings  of  per¬ 
formance  are  moderate  (r  =  .23).  The  correlation  with 
CBPM  scores  is  small  but  also  significant  (.08; p  <  .05). 
Thus,  the  correlations  with  training  time  support  the 
construct  validity  of  the  AT-SAT  field  criterion  mea¬ 
sures.  (Sample  sizes  for  the  HFPM  were  too  small  to 
conduct  these  analyses.) 

Linkage  Analysis 

A  panel  of  10  controller  SMEs  performed  a  judg¬ 
ment  task  with  the  CBPM  items.  These  controllers 
were  divided  into  three  groups,  and  each  group  was 
responsible  for  approximately  one  third  of  the  40 
critical  tasks  that  were  targeted  by  the  CBPM.  They 
reviewed  each  CBPM  scenario  and  the  items,  and 
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analysis  were  involved  in  each  item.  These  ratings 
were  then  discussed  by  the  entire  group  until  a 
consensus  was  reached.  Results  of  that  judgment  task 
appear  in  Table  4.20.  For  each  task,  the  table  shows 
the  number  of  CBPM  items  that  this  panel  agreed 
measured  that  task. 

Similarly,  10  controller  SMEs  performed  a  judg¬ 
ment  task  with  the  seven  HFPM  scenarios.  These 
controllers  were  divided  into  two  groups,  and  each 
group  was  responsible  for  half  of  the  scenarios.  Each 
scenario  was  viewed  in  three  10-minute  segments, 
and  group  members  noted  if  a  critical  subactivity  was 
performed.  After  the  three  1 0-minute  segments  for  a 
given  scenario  were  completed,  the  group  discussed 
their  ratings  and  arrived  at  a  consensus  before  pro¬ 
ceeding  to  the  next  scenario.  Results  of  these  judg¬ 
ments  can  also  be  found  in  Table  4.20.  In  summary, 
38  of  the  40  critical  subactivities  were  covered  by  at 
least  a  subset  of  the  seven  scenarios.  On  average, 
almost  25  subactivities  appeared  in  each  scenario. 


Conclusions 

The  38-item  CBPM  composite  provides  a  very  good 
measure  of  the  technical  skills  necessary  to  separate 
aircraft  effectively  and  efficiently  on  the  “real  job.”  The 
.61  correlation  with  the  highly  realistic  HFPM  (Factor 
1)  is  especially  supportive  of  its  construct  validity  for 
measuring  performance  in  the  very  important  technical 
proficiency-related  part  of  the  job.  Additional  ties  to  the 
actual  controller  job  are  provided  by  the  links  of  CBPM 
items  to  the  most  important  controller  tasks  identified 
in  the  job  analysis. 

The  performance  ratings  provide  a  good  picture  of 
the  typical  performance  over  time  elements  of  the  job. 
Obtaining  both  a  supervisor  and  a  peer  perspective  on 
controller  performance  provides  a  relatively  compre¬ 
hensive  view  ofday-to-day  performance.  High  interrater 
agreement  across  the  two  rating  sources  further  strength¬ 
ens  the  argument  that  the  ratings  are  valid  evaluations  of 
controller  performance. 

Thus,  impressive  construct  validity  evidence  is  dem¬ 
onstrated  for  both  the  CBPM  and  the  rating  composite. 
Overall,  we  believe  the  38-item  CBPM  and  the  rating 
composite  represent  a  comprehensive  and  valid  set  of 
criterion  measures. 
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CHAPTER  5.1 


Field  Procedures  for  Concurrent  Validation  Study 

Lucy  B.  Wilson,  Christopher  J.  Zamberlan,  and  James  H.  Harris 

Caliber  Associates 


The  concurrent  validation  data  collection  was  carried 
out  in  12  locations  from  May  to  July,  1997.  Additional 
data  were  collected  in  4  locations  from  March  to  May, 
1998  to  increase  the  sample  size.  Data  collection  activi¬ 
ties  involved  two  days  of  computer-aided  test  adminis¬ 
tration  with  air  traffic  controllers  and  the  collection  of 
controller  performance  assessments  from  supervisory 
personnel  and  peers.  Each  site  was  managed  by  a  trained 
Test  Site  Manager  (TSM)  who  supervised  trained  on¬ 
site  data  collectors,  also  known  as  Test  Administrators 
(TAs).  A  subset  of  100  air  traffic  controllers  from  the 
May-July  sample  (who  completed  both  the  predictor 
and  criterion  battery  of  testing  and  for  whom  complete 
sets  of  performance  assessment  information  were  avail¬ 
able),  was  selected  to  complete  the  high  fidelity  criterion 
test  at  the  Academy  in  Oklahoma  City.  See  Chapter  4  for 
a  description  of  this  activity. 

Criterion  Measure  Pretest 

An  in-field  pretest  of  the  computerized  criterion 
measure  and  the  general  protocol  to  be  used  in  the 
concurrent  validation  test  was  conducted  in  April,  1997. 
The  en-route  air  traffic  control  centers  of  Salt  Lake  City, 
UT  and  Seattle,  WA  served  as  pretest  sites.  A  trained 
TSM  was  on  site  and  conducted  the  pretest  in  each 
location. 


Field  Site  Locations 

In  1997,  the  concurrent  validation  testing  was  con¬ 
ducted  in  1 2  en-route  air  traffic  control  centers  across  the 
country.  The  test  center  sites  were: 


•  Atlanta,  GA 

•  Albuquerque,  NM 

•  Boston,  MA 

•  Denver,  CO 

•  Ft.  Worth,  TX 

•  Houston,  TX 


•  Jacksonville,  FL 

•  Kansas  City,  MO 

•  Los  Angeles,  CA 

•  Memphis,  TN 

•  Miami,  FL 

■  Minneapolis,  MN 


The  additional  testing  in  1 998  ran  in  Chicago,  Cleve¬ 
land,  Washington,  DC,  and  Oklahoma  City.  The  en- 
route  centers  of  Chicago  and  Cleveland  performed  like 
the  original  AT-SAT  sites,  testing  their  own  controllers. 
The  en-route  center  at  Leesburg,  Virginia,  which  serves 
the  Washington,  DC  area,  tested  their  controllers  as  well 
as  some  from  New  York.  At  the  Mike  Monroney  Aero¬ 
nautical  Center  in  Oklahoma  City,  the  Civil  Aeromedi- 
cal  Institute  (CAMI),  with  the  help  of  Omni  personnel, 
tested  controllers  from  Albuquerque,  Atlanta,  Houston, 
Miami,  and  Oakland.  All  traveling  controllers  were 
scheduled  by  Caliber  with  the  help  of  Arnold  T  revette  in 
Leesburg  and  Shirley  Hoffpauir  in  Oklahoma  City. 

Field  Period 

Data  collection  activities  began  early  in  the  Ft.  Worth 
and  Denver  Centers  in  May,  1997.  The  remaining  nine 
centers  came  on  line  two  weeks  later.  To  ensure  adequate 
sample  size  and  diversity  of  participants,  one  additional 
field  site  —  Atlanta  —  was  included  beginning  in  June 
1997  The  concurrent  data  collection  activities  contin¬ 
ued  in  all  locations  until  mid-July. 

Of  the  four  sites  in  1 998,  Chicago  started  the  earliest 
and  ran  the  longest,  for  a  little  over  two  months  begin¬ 
ning  in  early  March.  Washington,  DC  began  simulta¬ 
neously,  testing  and  rating  for  just  under  two  months. 
Cleveland  and  Oklahoma  City  began  a  couple  of  weeks 
into  March  and  ended  after  about  four  and  five  weeks, 
respectively. 

Selection  and  Training  of  Data  Collectors 

A  total  of  13  experienced  data  collection  personnel 
were  selected  to  serve  as  TSMs  during  the  first  data 
collection.  One  manager  was  assigned  to  each  of  the  test 
centers  and  one  TSM  remained  on  call  in  case  an 
emergency  replacement  was  needed  in  the  field. 

All  TSMs  underwent  an  intensive  3-day  training  in 
Fairfax,  VA  from  April  22  to  24, 1 997.  The  training  was 
led  by  the  team  of  designers  of  the  concurrent  validation 
tests.  The  objective  of  the  training  session  was  three-fold: 


13 


•  To  acquaint  TSMs  with  the  FAA  and  the  en  route  air 
traffic  control  environment  in  which  the  testing  was  to 
be  conducted 

•  To  familiarize  TSMs  with  the  key  elements  of  the 
concurrent  validation  study  and  their  roles  in  it 

•  To  ground  TSMs  in  the  AT-SAT  test  administration 
protocol  and  field  procedures. 

A  copy  of  the  TSM  training  agenda  is  attached. 

Each  TSM  was  responsible  for  recruiting  and  training 
his  or  her  on-site  data  collectors  who  administered  the 
actual  test  battery.  The  TSM  training  agenda  was  adapted 
for  use  in  training  on-site  data  collectors.  In  addition  to 
didactic  instruction  and  role-playing,  the  initial  test 
administrations  of  all  on-site  data  collectors  were  ob¬ 
served  and  critiqued  by  the  TSMs. 

Three  TSMs  repeated  their  role  in  the  second  data 
collection.  Because  of  the  unique  role  of  the  fourth  site 
in  the  second  data  collection  (e.g.,  a  lack  of  previous 
experience  from  the  first  data  collection  and  three  times 
as  many  computers,  or  “testing  capability,”  as  any  other 
testing  site),  Caliber  conducted  a  special,  lengthier  train¬ 
ing  for  CAMI  personnel  in  Oklahoma  City  before  the 
second  data  collection  began. 

Site  Set  Up 

TSMs  traveled  to  their  sites  a  week  in  advance  of  the 
onset  of  data  collection  activities.  During  this  week  they 
met  with  the  en-route  center  personnel  and  the  “Partner 
Pairs”  assigned  to  work  with  them.  The  Partner  Pairs 
were  composed  of  a  member  of  ATC  management  and 
the  union  representative  responsible  for  coordinating 
the  center’s  resources  and  scheduling  the  air  traffic 
controllers  for  testing.  Their  assistance  was  invaluable  to 
the  success  of  the  data  collection  effort. 

TSMs  set  up  and  secured  their  testing  rooms  on  site 
during  this  initial  week  and  programmed  five  computers 
newly  acquired  for  use  in  the  concurrent  validation. 
They  trained  their  local  data  collectors  and  observed 
their  first  day’s  work. 

Air  Traffic  Controller  Testing 

Up  to  five  controllers  could,  and  frequently  were, 
tested  on  an  8-hour  shift.  Testing  was  scheduled  at  the 
convenience  of  the  center,  with  most  of  the  testing 
occurring  during  the  day  and  evening  shifts,  although 
weekend  shifts  were  included  at  the  discretion  of  the  site. 
Controllers  were  scheduled  to  begin  testing  at  the  same 


time.  While  Oklahoma  City  had  the  capacity  to  test  1 5 
controllers  at  a  time,  it  did  not  use  its  expanded  capabil¬ 
ity  and  operated  like  every  other  five-computer  site,  for 
all  intents  and  purposes. 

At  the  beginning  of  the  first  day  of  the  2-day  testing 
effort,  the  data  collector  reviewed  the  Consent  Form 
with  each  participating  controller  and  had  it  signed  and 
witnessed.  (See  the  appendix  for  a  copy  of  the  Consent 
Form.)  Each  controller  was  assigned  a  unique  identifi¬ 
cation  number  through  which  all  parts  of  the  concurrent 
validation  tests  were  linked. 

The  predictor  battery  usually  was  administered  on  the 
first  day  of  controller  testing.  The  predictor  battery  was 
divided  into  four  blocks  with  breaks  permitted  between 
each  block  and  lunch  generally  taken  after  completion  of 
the  second  block. 

Thesecond  day  of  testing  could  occur  as  early  as  the 
day  immediately  following  the  first  day  of  testing  or 
could  be  scheduled  up  to  several  weeks  later.  The 
second  day  of  concurrent  validation  testing  involved 
completion  of  the  computerized  criterion  test,  that  is, 
the  Computer  Based  Performance  Measure  (CBPM), 
and  the  Biographical  Information  Form.  (See  appen¬ 
dix  for  a  copy  of  the  Biographical  Information  Form.) 
At  the  end  of  the  second  day  of  testing,  participating 
controllers  were  asked  to  give  their  social  security 
numbers  so  that  archival  information  (e.g.,  scores  on 
Office  of  Personnel  Management  employment  tests) 
could  be  retrieved  and  linked  to  their  concurrent 
validation  test  results. 

Supervisory  Assessments 

Participating  controllers  nominated  two  supervisory 
personnel  and  two  peers  to  complete  assessments  of  them 
as  part  of  the  criterion  measurement.  While  the  selection 
of  the  peer  assessors  was  totally  at  the  discretion  of  the 
controller,  supervisory  and  administrative  staff  had  more 
leeway  in  selecting  the  supervisory  assessors  (although 
not  one’s  “supervisor  of  record”)  from  the  much  smaller 
pool  of  supervisors  in  order  to  complete  the  ratings. 
Throughout  the  data  collection  period,  supervisors  and 
peers  assembled  in  small  groups  and  were  given  stan¬ 
dardized  instructions  by  on-site  data  collectors  in  the 
completion  of  the  controller  assessments.  To  the  extent 
feasible,  supervisors  and  peers  completed  assessments  in 
a  single  session  on  all  the  controllers  who  designated 
them  as  their  assessor.  When  the  assessment  form  was 
completed,  controller  names  were  removed  and  replaced 
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by  their  unique  identification  numbers.  The  assessment 
forms  were  placed  in  sealed  envelopes  as  a  further  means 
of  protecting  confidentiality. 

During  the  second  data  collection,  assessors  some¬ 
times  viewed  PDRI’s  “How  To”  video  in  lieu  of  verbal 
instruction.  This  was  especially  important  at  the  five 
non-testing  sites  that  had  no  TSMs  or  on-site  data 
collectors  (Albuquerque,  Atlanta,  Houston,  Miami,  and 
Oakland).  The  four  testing  sites  employed  the  video 
much  less  frequently,  if  at  all. 

Record  Keeping  and  Data  Transmission 

On-site  data  collectors  maintained  records  of  which 
controllers  had  participated  and  which  tests  had  been 
completed.  This  information  was  reported  on  a  daily 
basis  to  TSMs.  Several  times  a  week  on-site  data  collec¬ 


tors  transmitted  completed  test  information  (on  dis¬ 
kettes)  and  hard  copies  of  the  Biographical  Information 
and  performance  assessment  forms  to  the  data  processing 
center  in  Alexandria,  VA. 

Site  Shut  Down 

At  the  end  of  the  data  collection  period,  each  site  was 
systematically  shut  down.  The  predictor  and  criterion 
test  programs  were  removed  from  the  computers,  as  were 
any  data  files.  Record  logs,  signed  consent  forms,  unused 
test  materials,  training  manuals  and  other  validation 
materials  were  returned  to  Caliber  Associates.  Chicago, 
the  last  site  of  the  second  data  collection  effort,  shut 
down  on  Monday,  May  11,  1998. 
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CHAPTER  5.2 


Development  Of  Pseudo-applicant  Sample 
Anthony  Bayless,  Caliber  Associates 


RATIONALE  FOR 
PSEUDO-APPLICANT  SAMPLE 

Prior  to  becoming  a  Full  Performance  Level  (FPL) 
controller,  ATCSs  have  been  previously  screened  on 
their  entry-level  OPM  selection  test  scores,  perfor¬ 
mance  in  one  of  the  academy  screening  programs,  and 
on-the-job  training  performance.  Because  of  these 
multiple  screens  and  stringent  cutoffs,  only  the  better 
performing  ATCSs  are  retained  within  the  air  traffic 
workforce.  For  these  reasons,  the  concurrent  valida¬ 
tion  of  the  AT-SAT  battery  using  a  sample  of  ATCSs 
is  likely  to  result  in  an  underestimate  of  the  actual 
validity  because  of  restriction  in  range  in  the  predic¬ 
tors.  The  goal  of  this  part  of  the  project,  then,  was  to 
administer  the  AT-SAT  predictor  battery  to  a  sample 
that  more  closely  resembled  the  likely  applicant  pool 
than  would  a  sample  of  ATCS  job  incumbents. 

The  purpose  of  including  a  pseudo-applicant  (PA) 
sample  in  the  validation  study  was  to  obtain  variance 
estimates  from  an  unrestricted  sample  (i.e.,  not  explicitly 
screened  on  any  prior  selection  criteria).  Data  collected 
from  the  PA  study  were  used  to  statistically  “correct” 
predictor  scores  obtained  from  the  restricted,  concurrent 
validation  sample  of  ATCS  job  incumbents.  This  statis¬ 
tical  correction  was  necessary  because  the  validity  of 
predictors  is  based  on  the  strength  of  the  relationship 
between  the  predictors  and  job  performance  criteria.  If 
this  relationship  was  assessed  using  only  the  restricted 
sample  (i.e.,  FAA  job  incumbents  who  have  already  been 
screened  and  selected)  without  any  statistical  correction, 
the  strength  of  the  relationships  between  the  predictors 
and  job  performance  criteria  would  be  underestimated.1 
This  underestimation  of  the  validity  of  the  predictors 
might  lead  to  an  omission  of  an  important  predictor 
based  on  an  inaccurate  estimation  of  its  validity.  By 
using  the  PA  data  to  obtain  variance/covariance  esti¬ 


mates  from  an  unrestricted  sample  (i.e.,  a  pool  of  subjects 
that  more  closely  represents  the  potential  range  of 
applicants),  the  underestimation  of  predictor  validity 
computed  from  the  restricted  sample  can  be  corrected. 

ATCS  Applicant  Pool 

The  administration  of  the  AT-SAT  predictor  battery 
to  a  sample  closely  resembling  the  applicant  pool  re¬ 
quired  an  analysis  of  the  recent  ATCS  applicant  pool. 
Therefore,  the  project  team  requested  from  the  FAA  data 
about  recent  applicants  for  the  ATCS  job.  Because  of  a 
recent  hiring  freeze  on  ATCS  positions,  the  latest  back¬ 
ground  data  available  for  ATCS  applicants  was  from 
1990  through  part  of  1992.  Although  the  data  were 
somewhat  dated  (i.e.,  1990-1992),  it  did  provide  some 
indication  of  the  characteristics  that  should  be  emulated 
in  the  PA  sample.  Based  on  a  profile  analysis  provided  by 
the  FAA,  relevant  background  characteristics  of  36,024 
actual  applicants  for  FAA  ATCS  positions  were  made 
available.  Table  5.2.1  provides  a  breakout  of  some  per¬ 
tinent  variables  from  that  analysis. 

The  data  indicated  that  about  8 1  %  of  applicants  were 
male,  50%  had  some  college  education  but  no  degree, 
and  26%  had  a  bachelor’s  degree.  A  disconcerting  fact 
from  the  OPM  records  was  the  large  percentage  of 
missing  cases  (51.3%)  for  the  race/ethnicity  variable. 
Information  available  for  the  race/ethnicity  variable  rep¬ 
resented  data  from  17,560  out  of 36,024  cases.  Another 
issue  of  some  concern  was  the  age  of  the  data  provided. 
The  latest  data  were  at  least  four  years  old.  Although  it 
seems  unlikely  that  the  educational  profile  of  applicants 
would  have  changed  much  over  four  years,  it  was  more 
likely  that  the  gender  and  the  race/ethnicity  profiles  may 
have  changed  to  some  extent  over  the  same  period  of  time 
(i.e.,  more  female  and  ethnic  minority  applicants). 


1  This  underestimate  is  the  result  of  decreased  variation  in  the  predictor  scores  of  job  incumbents;  they  would  all  be 
expected  to  score  relatively  the  same  on  these  predictors.  When  there  is  very  little  variation  in  a  variable,  the  strength  of  its 
association  with  another  variable  will  be  weaker  than  when  there  is  considerable  variation.  In  the  case  of  these  predictors, 
the  underestimated  relationships  are  a  statistical  artifact  resulting  from  the  sample  selection. 


Because  of  the  concern  about  the  age  of  the  applicant 
pool  data  and  the  amount  of  missing  data  for  the  race/ 
ethnicity  variable,  a  profile  of  national  background  char¬ 
acteristics  was  obtained  from  the  U.S.  Bureau  of  the 
Census.  As  shown  in  Table  5.2.2,  1990  data  from  the 
U.S.  Bureau  of  the  Census  indicated  the  following 
national  breakout  for  race/ethnicity: 

Without  more  up-to-date  and  accurate  data  about  the 
applicant  pool,  the  national  data  were  used  to  inform 
sampling  decisions.  Using  the  percentages  provided  above 
for  race/ethnicity  upon  which  to  base  preliminary  sam¬ 
pling  plans,  we  recommended  a  total  sample  size  of  at 
least  300  PAs  be  obtained  assuming  it  followed  the  same 
distributional  characteristics  as  the  national  race/ ethnicity 
data. 

Pseudo-Applicant  Sample  Composition  and 
Characteristics 

Again,  the  impetus  for  generating  a  PA  sample  was  to 
administer  the  AT-SAT  predictor  battery  to  a  sample 
that  more  closely  resembled  the  likely  applicant  pool 
than  would  a  sample  of  ATCS  job  incumbents.  The 
project  team  decided  to  collect  data  from  two  different 
pools  of  PAs:  one  civilian  and  the  other  military.  The 
civilian  PA  sample  was  generated  using  public  advertise¬ 
ment  and  comprised  the  volunteers  obtained  from  such 
advertisement.  Because  the  sample  size  of  the  civilian  PA 
sample  was  dependent  on  an  unpredictable  number  of 
volunteers,  a  decision  was  made  to  also  collect  data  from 
a  military  PA  sample.  The  military  PA  sample  afforded 
a  known  and  large  sample  size  and  access  to  scores  on 
their  Armed  Services  Vocational  Aptitude  Battery 
(ASVAB)  with  their  granted  permission.  Each  of  these 
two  pools  of  PAs  are  described  in  the  following  two 
subsections. 

Civilian  Pseudo-Applicant  Sample 

Because  the  computer  equipment  with  the  predic¬ 
tor  and  criterion  software  was  already  set  up  at  each  of 
the  12  CV  testing  sites,  public  advertisements  were 
placed  locally  around  the  CV  testing  sites  to  generate 
volunteers  for  the  civilian  PA  sample.  The  goal  for 
each  testing  site  was  to  test  40  PAs  to  help  ensure  an 
adequate  civilian  PA  sample  size. 

Public  advertisement  for  the  civilian  PA  sample  was 
accomplished  via  several  different  methods.  One  method 
was  to  place  classified  advertisements  in  the  largest  local, 
metropolitan  newspapers  (and  some  smaller  newspapers 
for  those  CV  sites  located  away  from  major  metropolitan 


areas).  An  example  classified  newspaper  advertisement  is 
shown  in  Figure  5.2.1.  Another  means  of  advertising  the 
testing  opportunity  was  to  place  flyers  at  locations  in 
proximity  to  the  testing  site.  For  example,  flyers  were 
placed  at  local  vocational  technical  schools  and  colleges/ 
universities.  An  example  flyer  advertisement  is  shown  in 
Figure  5.2.2.  A  third  means  of  advertising  the  testing  to 
civilian  PAs  was  to  publicize  the  effort  via  ATCS  to  their 
family,  friends,  and  acquaintances. 

When  responding  to  any  form  of  advertisement, 
potential  civilian  PAs  were  requested  to  call  a  toll-free 
number  where  a  central  scheduler/coordinator  would 
screen  the  caller  on  minimum  qualifications  (i.e.,  US 
citizenship,  ages  between  17  and  30,  AND  at  least  3 
years  of  general  work  experience)  and  provide  the 
individual  with  background  about  the  project  and  the 
possible  testing  dates  and  arrival  time(s).  After  a  PA 
had  been  scheduled  for  testing,  the  scheduler/coordi¬ 
nator  would  contact  the  testing  site  manager  for  the 
relevant  testing  location  and  notify  him/her  so  that 
the  testing  time  slot  could  be  reserved  for  a  PA  instead 
of  an  ATCS  (for  those  sites  testing  PAs  and  ATCSs 
concurrently).  The  scheduler/coordinator  would  also 
mail  a  form  letter  to  the  newly  scheduled  PA  indicat¬ 
ing  the  agreed  upon  testing  time  and  date,  directions 
to  the  testing  facility,  and  things  to  bring  with  them 
(i.e.,  driver’s  license  and  birth  certificate  or  passport) 
for  verification  of  age  and  citizenship. 

Military  Pseudo-Applicant  Sample 

Because  of  the  uncertainty  about  being  able  to 
generate  a  sufficient  PA  sample  from  the  civilian 
volunteers,  it  was  decided  to  collect  additional  data 
from  a  military  PA  sample.  Again,  the  military  PA 
sample  would  afford  a  known  sample  size  and  access 
to  their  ASVAB  scores  which  would  prove  useful  for 
validation  purposes.  For  these  reasons,  the  FAA  nego¬ 
tiated  with  the  U.S.  Air  Force  to  test  participants  at 
Keesler  A.F.B.,  Biloxi,  Mississippi.  The  military  PAs 
were  students  and  instructors  stationed  at  Keesler 
A.F.B.  Predictor  data  were  collected  from  approxi¬ 
mately  262  military  PAs  of  which  132  (50.4%)  were 
currently  enrolled  in  the  Air  Traffic  Control  School; 
106  (40.5%)  were  students  in  other  fields  such  as 
Weather  Apprentice,  Ground  Radar  Maintenance, 
and  Operations  Resource  Management;  and  24  (9.2%) 
were  Air  Traffic  Control  School  instructors.  Table 
5.2.3  provides  a  breakout  of  gender  and  race/ethnicity 
by  type  of  sample. 
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The  data  in  5.2.1  indicate  that  the  civilian  and 
military  PA  samples  were  very  similar  with  respect  to 
their  gender  and  race/ethnicity  profiles.  In  addition, 
both  of  the  PA  samples  were  more  diverse  than  the 
ATCS  sample  and  fairly  similar  to  the  1990  U.S. 
Bureau  of  Census  national  breakdown  (compare  data 
of  Table  5.2.1  to  data  of  Table  5.2.2). 

On-Site  Data  Collection 

Pseudo-applicants  were  administered  the  predictor 
battery  using  the  same  testing  procedures  as  followed  for 
the  ATCS  CV  sample.  The  only  differences  between  the 
civilian  and  military  PA  sample  data  collection  proce¬ 
dures  were  that: 

1 .  civilians  were  tested  with  no  more  than  four  other 
testing  participants  at  a  time  (due  to  the  limited 
number  of  computers  available  at  any  one  of  the 
testing  sites),  whereas  military  PAs  at  Keesler  A.F.B. 
were  tested  in  large  groups  of  up  to  50  participants 
per  session. 

2.  the  replacement  caps  for  select  keyboard  keys 
were  not  compatible  with  the  rental  computer  key¬ 
boards  and  were  unusable.  Because  of  this  problem, 
index  cards  were  placed  adjacent  to  each  of  the  com¬ 
puter  test  stations  informing  the  test  taker  of  the 
proper  keys  to  use  for  particular  predictor  tests.  The  use 
of  the  index  cards  instead  of  the  replacement  keys  did 
not  appear  to  cause  any  confusion  for  the  test  takers. 


Test  site  administrators  provided  the  PAs  with  a 
standardized  introduction  and  set  of  instructions  about 
the  testing  procedures  to  be  followed  during  the  com¬ 
puter-administered  battery.  During  the  introduction  the 
administrators  informed  the  PAs  of  the  purpose  of  the 
study  and  any  risks  and  benefits  associated  with  partici¬ 
pation  in  the  study.  The  confidentiality  of  each  partici¬ 
pants’  results  were  emphasized.  In  addition,  participants 
were  asked  to  sign  a  consent  agreement  attesting  to  their 
voluntary  participation  in  the  study,  their  understanding 
of  the  purpose  of  the  study,  the  risks/benefits  of  partici¬ 
pation,  and  the  confidentiality  of  their  results.  For  the 
military  PAs,  those  who  signed  a  Privacy  Act  Statement 
gave  their  permission  to  link  their  predictor  test  results 
with  their  ASVAB  scores. 

The  testing  volunteers  were  required  to  sacrifice  one 
eight-hour  day  to  complete  the  predictor  battery.  Al¬ 
though  testing  volunteers  were  not  compensated  for 
their  time  due  to  project  budget  constraints,  they  were 
provided  with  compensation  for  their  lunch. 

Correction  for  Range  Restriction 

As  mentioned  previously,  the  reason  for  collecting 
predictor  data  from  PAs  was  to  obtain  variance  estimates 
from  individuals  more  similar  to  actual  applicants  for  use 
in  correcting  validity  coefficients  for  tests  derived  from 
a  restricted  sample  (i.e.,  job  incumbents).  A  description 
of  the  results  of  the  range  restriction  corrections  is 
contained  in  Chapter  5.5. 
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CHAPTER  5.3 


Development  of  Database 

Ani  S.  DiFazio 
HumRRO 


The  soundness  of  the  validity  and  fairness  analyses 
conducted  on  the  beta  test  data,  and  of  the  recommen¬ 
dations  based  on  those  results,  was  predicated  on  reliable 
and  complete  data.  Therefore,  database  design,  imple¬ 
mentation,  and  management  were  of  critical  importance 
in  validating  the  predictor  tests  and  selecting  tests  for 
inclusion  in  Version  1  of  the  Test  Battery.  The  Valida¬ 
tion  Analysis  Plan  required  many  diverse  types  of  data 
from  a  number  of  different  sources.  This  section  de¬ 
scribes  the  procedures  used  in  processing  these  data  and 
integrating  them  into  a  cohesive  and  reliable  analysis 
database. 

Data  Collection  Instruments 

As  described  in  section  5.1,  data  from  computerized 
predictor  and  criterion  tests  were  automatically  written 
as  ASCII  files  by  the  test  software  at  the  test  sites. 
Depending  on  the  test,  the  data  were  written  either  as  the 
examinee  was  taking  the  test  or  upon  completion  of  the 
test.  The  data  file  structure  written  by  each  test  program 
was  unique  to  that  test.  Each  file  represented  an  indi¬ 
vidual  test  taken  by  a  single  examinee.  A  complete 
battery  of  tests  consisted  of  13  computerized  predictor 
tests  as  well  as  one  computerized  criterion  test.  For  the 
first  AT-SAT  data  collection  (AT-SAT  1),  high-fidelity 
criterion  measures  were  also  obtained  on  a  subset  of  the 
controller  participants. 

In  addition  to  the  automated  test  data,  several  differ¬ 
ent  types  of  data  were  collected  by  hard  copy  data 
collection  instruments.  These  include  three  biographical 
information  forms  for  controller  participants,  pseudo¬ 
applicant  participants,  and  assessors,  a  Request  of  SSN 
for  Retrieval  of  the  Historical  Archival  Data  form,  and  a 
Criterion  Assessment  Rating  Assessment  Sheet.  The 
Validation  Analysis  Plan  also  called  for  the  integration  of 
historical  archival  data  from  the  FAA. 


Initial  Data  Processing 

Automated  Test  Files 

Data  Transmittals.  The  automated  test  data  col¬ 
lected  at  the  1 7  test  sites  were  initially  sent  to  HumRRO 
via  Federal  Express  on  a  daily  basis.  This  was  done  so  that 
analysts  could  monitor  test  sites  closely  in  the  beginning 
of  the  test  period  and  solve  problems  immediately  as  they 
arose.  Once  confident  that  a  test  site  was  following  the 
procedures  outlined  in  the  AT-SAT  Concurrent  Valida¬ 
tion  Test  Administration  Manual  and  was  not  having 
difficulty  in  collecting  and  transmitting  data,  it  was  put 
on  a  weekly  data  transmittal  schedule.  Out  of  approxi¬ 
mately  seven  and  a  half  weeks  of  testing,  the  typical  site 
followed  a  daily  transmittal  schedule  for  the  first  two 
weeks  and  then  sent  data  on  a  weekly  schedule  for  the 
remainder  of  the  testing  period.  In  total,  HumRRO 
received  and  processed  297  Federal  Express  packages 
containing  data  transmittals  from  the  17  test  sites. 

The  sites  were  provided  detailed  instructions  on  the 
materials  to  be  included  in  a  data  transmittal  packet. 
First,  packets  contained  a  diskette  of  automated  test  files 
for  each  day  of  testing.2  Sites  were  asked  to  include  a 
Daily  Activity  Log  (DAL)  if  any  problems  or  situations 
arose  that  might  affect  examinee  test  performance.  Along 
with  each  diskette,  the  sites  were  required  to  submit  a 
Data  Transmittal  Form  (DTF)3  which  provided  an 
inventory  of  the  pieces  of  data  contained  in  the  transmit¬ 
tal  packet.  During  the  testing  period,  HumRRO  re¬ 
ceived  and  processed  622  hard  copy  DTFs. 

Data  Processing  Strategy.  Because  of  the  magnitude 
of  data  and  the  very  limited  time  allocated  for  its  process¬ 
ing,  a  detailed  data  processing  plan  was  essential.  The 
three  main  objectives  in  developing  a  strategy  for  processing 
the  automated  test  data  from  the  test  sites  were  to  — 


2  Some  sites  wrote  the  transmittal  diskette  at  the  end  of  the  test  day,  while  others  cut  the  data  at  the  end  of  a  shift.  In  these 
cases,  more  than  one  diskette  would  be  produced  for  each  test  day. 

3  While  a  DTF  was  supposed  to  be  produced  for  each  diskette  transmitted,  some  sites  sent  one  DTF  covering  a  number  of 
test  days,  and,  conversely,  more  than  one  DTF  describing  a  single  diskette. 
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•  Ensure  that  the  test  sites  were  transmitting  all  the  data 
they  were  collecting  and  that  no  data  were  inadvertently 
falling  through  the  cracks  in  the  field. 

•  Closely  monitor  the  writing  and  transmittal  of  data  by 
the  sites,  so  that  problems  would  be  quickly  addressed 
before  large  amounts  of  data  were  affected. 

•  Identify  and  resolve  problematic  or  anomalous  files. 

To  accomplish  these  objectives,  the  test  data  were 
initially  passed  through  two  stages  of  data  processing  as 
testing  was  in  progress.  A  third  processing  stage,  de¬ 
scribed  in  the  later  subsection  “Integration  of  AT-SAT 
Data,”  occurred  after  testing  was  completed  and  served 
to  integrate  the  diverse  data  collected  for  this  effort  into 
a  reliable  and  cohesive  database. 

During  the  testing  period,  up  to  four  work  stations 
were  dedicated  to  processing  data  transmittal  packets 
sent  by  the  sites.  One  work  station  was  reserved  almost 
exclusively  for  preliminary  processing  of  the  packets. 
This  “stage  one”  processing  involved  unpacking  Federal 
Express  transmittals,  identifying  obvious  problems,  date 
stamping  and  transcribing  the  DTF  number  on  all  hard 
copy  data  collection  forms,  summarizing  AT-SAT  1 
examinee  demographic  information  for  weekly  reports, 
and  ensuring  that  the  data  were  passed  on  to  the  next 
stage  of  data  processing. 

The  “stage  two”  data  processors  were  responsible  for 
the  initial  computer  processing  of  the  test  data.  Their 
work  began  by  running  a  Master  Login  procedure  that 
copied  the  contents  of  each  diskette  transmitted  by  the 
test  sites  onto  the  work  station’s  hard  drive.  This  proce¬ 
dure  produced  a  hard  copy  list  of  the  contents  of  the 
diskette  and  provided  a  baseline  record  of  all  the  data 
received  from  the  sites.4  Next,  using  a  key  entry  screen 


developed  solely  for  this  application,  information  on 
participant  data  from  each  DTF  was  automated  and  Statis¬ 
tical  Analysis  System  (SAS)  DTF  files  were  created. 5 

This  “stage  two”  automation  of  DTF  hard  copy  forms 
served  both  record  keeping  and  quality  assurance  func¬ 
tions.  To  gauge  whether  the  sites  were  transmitting  all 
the  data  they  collected,  the  inventory  of  participant 
predictor  and  CBPM  test  data  listed  on  the  DTF  was 
compared  electronically  to  the  files  contained  on  the 
diskette  being  processed.6  Whenever  there  was  a  discrep¬ 
ancy,  the  data  processing  software  developed  for  this 
application  automatically  printed  a  report  listing  the 
names  of  the  discrepant  files.  Discrepancies  involving 
both  in  fewer  and  more  files  recorded  on  the  diskettes 
than  expected  from  the  DTF  were  reported.  Test  site 
managers/administrators  were  then  contacted  by  the 
data  processors  to  resolve  the  discrepancies.  This  proce¬ 
dure  identified  files  that  test  sites  inadvertently  omitted 
in  the  data  transmittal  package.7 

As  helpful  as  this  procedure  was  in  catching  data  that 
may  have  been  overlooked  at  sites,  it  was  able  to  identify 
missing  files  only  if  the  DTF  indicated  that  they  should 
not  be  missing.  The  procedure  would  not  catch  files  that 
were  never  listed  on  the  DTF.  It  was  clear  that  this  sort 
of  error  of  omission  was  more  likely  to  occur  when  large 
amounts  of  data  were  being  collected  at  sites.  While  the 
second  AT-SAT  data  collection  (AT-SAT  2)  tested  just 
over  300  participants,  AT-SAT  1  included  over  four  and 
a  half  times  that  number.  Therefore,  if  this  type  of  error 
of  omission  was  going  to  occur,  it  would  likely  occur 
during  the  first  AT-SAT  data  collection  rather  than  the 
second.  To  avoid  this  error,  the  AT-SAT  1  test  site 
managers  needed  to  assess  the  completeness  of  the  data 
sent  for  processing  against  other  records  maintained  at 


4  The  Master  Login  software  did  not  copy  certain  files,  such  as  those  with  zero  bytes. 

5  In  automating  the  DTF,  we  wanted  one  DTF  record  for  each  diskette  transmitted.  Because  sites  sometimes  included  the 
information  from  more  than  one  diskette  on  a  hard  copy  DTF,  more  than  one  automated  record  was  created  for  those 
DTFs.  Conversely,  if  more  than  one  hard  copy  DTF  was  transmitted  for  a  single  diskette,  they  were  combined  to  form  one 
automated  DTF  record. 

6  This  computerized  comparison  was  made  between  the  automated  DTF  and  an  ASCII  capture  of  the  DOS  directory  of  the 
diskette  from  the  test  site.  The  units  of  analysis  in  these  two  datasets  were  originally  different.  Since  a  record  in  the 
directory  capture  data  was  a  file  (i.e.,  an  examinee/test  combination),  there  was  more  than  one  record  per  examinee.  An 
observation  in  the  original  DTF  file  was  an  examinee,  with  variables  indicating  the  presence  (or  absence)  of  specific  tests.  In 
addition,  the  DTF  inventoried  predictor  tests  in  four  testing  blocks  rather  than  as  individual  tests.  Examinee/test-level  data 
were  generated  from  the  DTF  by  producing  dummy  electronic  DTF  records  for  each  predictor  test  that  was  included  in  a 
test  block  that  the  examinee  took.  Dummy  CBPM  DTF  records  were  also  generated  in  this  manner.  By  this  procedure,  the 
unit  of  analysis  in  the  automated  DTF  and  DOS  directory  datasets  was  made  identical  and  a  one-to-one  computerized 
comparison  could  be  made  between  the  DTF  and  the  data  actually  received. 

7  Conversely,  this  procedure  was  also  used  to  identify  and  resolve  with  the  sites  those  files  that  appeared  on  the  diskette,  but 
not  on  the  DTF. 
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the  site,  such  as  the  Individual  Control  Forms.  Approxi¬ 
mately  three  quarters  into  the  AT-SAT  1  testing  period, 
the  data  processors  developed  a  table  for  each  site  that 
listed  examinees  by  the  types  of  data8  that  had  been 
received  for  them.  A  sample  of  this  table  and  the  cover 
letter  to  test  site  managers  is  provided  in  Appendix  I.  The 
site  managers  were  asked  to  compare  the  information  on 
this  table  to  their  Individual  Control  Forms  and  any 
other  records  maintained  at  the  site.  The  timing  of  this 
exercise  was  important  because,  while  we  wanted  to 
include  as  many  examinees  as  possible,  the  test  sites  still 
had  to  be  operational  and  able  to  resolve  any  discrepan¬ 
cies  discovered.  The  result  of  this  diagnostic  exercise  was 
very  encouraging.  The  only  type  of  discrepancy  uncov¬ 
ered  was  in  cases  where  the  site  had  j  ust  sent  data  that  had 
not  yet  been  processed.  Because  no  real  errors  of  omis¬ 
sion  were  detected  and  since  AT-SAT  2  involved  fewer 
cases  that  AT-SAT  1,  this  diagnostic  exercise  was  not 
undertaken  for  AT-SAT  2. 

Further  quality  assurance  measures  were  taken  to 
identify  and  resolve  any  systematic  problems  in  data 
collection  and  transmission.  Under  the  premise  that 
correctly  functioning  test  software  would  produce  files 
that  fall  within  a  certain  byte  size  range  and  that  malfunc¬ 
tioning  software  would  not,  a  diagnostic  program  was 
developed  to  identify  files  that  were  too  small  or  too  big, 
based  on  “normal”  ranges  for  each  test.  The  objective  was 
to  avoid  pervasive  problems  in  the  way  that  the  test 
software  wrote  the  data  by  reviewing  files  with  suspicious 
byte  sizes  as  they  were  received.  To  accomplish  this,  files 
with  anomalous  byte  sizes  and  the  pertinent  DALs  were 
passed  on  to  a  research  analyst  for  review.  A  few  problems 
were  identified  in  this  way.  Most  notably,  we  discovered 
that  the  software  in  the  Scan  predictor  test  stopped 
writing  data  when  the  examinee  did  not  respond  to  test 
items.  Also,  under  some  conditions,  the  Air  Traffic 
Scenarios  test  software  did  not  write  data  as  expected; 
investigation  indicated  that  the  condition  was  rare  and 
that  the  improperly  written  data  could,  in  fact,  be  read 
and  used,  so  the  software  was  not  revised.  No  other 
systematic  problems  in  the  way  the  test  software  wrote 
data  were  identified. 

This  procedure  was  also  one  way  to  identify  files  with 
problems  of  a  more  idiosyncratic  nature.  The  identifica¬ 
tion  of  file  problems  by  the  data  processors  was  typically 


based  on  improper  file  name  and  size  attributes.  In  some 
cases,  the  sites  themselves  called  attention  to  problems 
with  files  whose  attributes  were  otherwise  normal.  In 
most  cases,  the  problem  described  by  the  site  involved 
the  use  of  an  incorrect  identification  number  for  an 
examinee  in  the  test  start-up  software.  A  number  of  other 
situations  at  the  test  sites  led  to  problematic  files,  such  as 
when  a  test  administrator  renamed  or  copied  a  file  when 
trying  to  save  an  examinee’s  test  data  in  the  event  of  a 
system  crash.  Very  small  files  or  files  containing  zero 
bytes  would  sometimes  be  written  when  an  administra¬ 
tor  logged  a  participant  onto  a  test  session  and  the 
examinee  never  showed  up  for  the  test.  In  the  first  few 
weeks  of  testing,  a  number  of  files  used  by  test  site 
managers  to  train  administrators  had  then  been  errone¬ 
ously  transmitted  to  the  data  processors.  It  is  important 
to  note  that  the  contents  of  the  test  files  were  not 
scrutinized  at  this  stage  of  processing. 

The  “stage  two”  processors  recorded  each  problem 
encountered  in  a  Problem  Log  developed  for  this  pur¬ 
pose.  The  test  site  manager  or  administrator  was  then 
contacted  and  the  test  site  and  data  processor  worked 
together  to  identify  the  source  of  the  problem.  This 
approach  was  very  important  because  neglected  system¬ 
atic  data  collection  and  transmittal  issues  could  have  had 
far-reaching  negative  consequences.  Resolution  of  the 
problem  typically  meant  that  the  test  site  would  re¬ 
transmit  the  data,  the  file  name  would  be  changed 
according  to  specific  manager/administrator  instruc¬ 
tions,  or  the  file  would  be  excluded  from  further  process¬ 
ing.  For  each  problem  identified,  stage  two  data  processors 
reached  a  resolution  with  the  test  sites,  and  recorded  that 
resolution  in  the  processor’s  Problem  Log. 

Once  all  of  these  checks  were  made,  data  from  the  test 
sites  were  copied  onto  a  ZIP9  disk.  Weekly  directories  on 
each  ZIP  disk  contained  the  test  files  processed  during  a 
given  week  for  each  stage  two  work  station.  The  data  in 
the  weekly  directories  were  then  passed  on  for  “stage 
three”  processing.  To  ensure  that  only  non-problematic 
files  were  retained  on  the  ZIP  disks  and  that  none  were 
inadvertently  omitted  from  further  processing,  a  weekly 
reconciliation  was  performed  that  compared  all  the  test 
files  processed  during  the  week  (i.e.,  those  copied  to  the 
work  station’s  hard  drive  by  the  Master  Login  procedure) 
to  the  files  written  on  the  week’s  ZIP  disk.  A  computer 


8  This  table  reported  whether  predictor  and  CBPM  test  data,  participant  biographical  information  forms,  and  SSN  Request 
Forms  had  been  received. 

9  ZIP  disks  are  a  virtually  incorruptible  data  storage  medium  that  hold  up  to  100  megabytes  of  data. 
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application  was  written  that  automatically  generated 
the  names  of  all  the  discrepant  files  between  these  two 
sources. 

Every  week,  each  stage  two  data  processor  met  with 
the  database  manager  to  discuss  these  discrepancies.  The 
data  processor  had  to  provide  either  a  rationale  for  the 
discrepancy  or  a  resolution.  The  most  typical  rationale 
was  that  the  data  processor  was  “holding  out”  a  file  or 
waiting  for  the  re-issuance  of  a  problem  file  from  the  test 
site.  Meticulous  records  were  kept  of  these  “hold-out” 
files  and  all  were  accounted  for  before  the  testing  periods 
were  completed.  Resolutions  of  discrepancies  typically 
included  deletion  or  addition  of  files  or  changes  to  file 
names.  In  these  cases,  the  database  manager  handled 
resolutions  and  the  reconciliation  program  was  re- 
executed  to  ensure  accuracy.  These  procedures  re¬ 
sulted  in  a  total  of 23, 1 07  files10  written  onto  ZIP  disk 
at  the  conclusion  of  stage  two  processing  for  AT-SAT 
1  and  2  combined. 

So  as  not  to  waste  analysis  time  during  AT-SAT  1 ,  raw 
CBPM  test  files  contained  on  weekly  ZIP  disks  were  sent 
to  PDRI  on  a  weekly  basis  during  the  testing  period, 
along  with  the  DALs  and  lists  of  files  with  size  problems. 
During  AT-SAT  2,  CBPM  files  were  sent  to  PDRI  at  the 
end  of  the  testing  period;  DALs  and  DTFs  were  sent  to 
PDRI  directly  from  the  sites.  Similarly,  Analogies  (AN), 
Planes  (PL),  Letter  Factory  (LA),  and  Scan  (SC)  raw  test 
files  were  sent  to  RGI  on  a  weekly  basis  during  AT-SAT 
1  and  at  the  end  of  the  testing  period  for  AT-SAT  2.  At 
the  end  of  the  AT-SAT  1  testing  period,  all  the  collected 
data  for  each  of  these  tests  were  re-transmitted  to  the 
appropriate  organization,  so  that  the  completeness  of  the 
cumulative  weekly  transmittals  could  be  assessed  against 
the  final  complete  transmittal. 

HumRRO  wrote  computer  applications  that  read  the 
raw  files  for  a  number  of  predictor  tests.  These  tests, 
which  contained  multiple  records  per  examinee,  were 
reconfigured  into  ASCII  files  with  a  single  record  for 
each  participant  for  each  test.  SAS  files  were  then  created 
for  each  test  from  these  reconfigured  files.  This  work  was 
performed  for  the  following  tests:  Applied  Math  (AM), 
Dials  (DI),  Memory  1  (ME),  Memory  2  (MR),  Sound 
(SN),  Angles  (AN),  Air  Traffic  Scenarios  (AT),  Time 
Wall  (TW),  and  the  Experience  Questionnaire  (EQ).  At 
the  conclusion  of  testing,  the  reconfigured  EQ  data  were 
sent  to  PDRI  for  scoring  and  analysis. 


Hard  Copy  Data 

Data  Handling  of  Participant  Biographical  Data 
and  Request  for  SSN  Forms.  As  mentioned  above,  stage 
one  processors  handled  the  data  transmittal  packages 
from  the  test  sites.  Once  each  hard  copy  form  had  been 
date  stamped,  these  processors  passed  the  participant 
biographical  forms  and  SSN  Request  Forms  to  stage  two 
processors.  Here,  as  in  the  processing  of  automated  test 
data,  to  ensure  that  all  the  data  indicated  on  the  DTF  had 
been  sent,  a  report  printed  by  the  DTF  automation 
program  listed  all  the  hard  copy  participant  forms  that 
the  DTF  indicated  should  be  present  for  an  examinee. 
The  stage  two  data  processors  were  then  required  to  find 
the  hard  copy  form  and  place  a  check  mark  in  the  space 
provided  by  the  reporting  program.  As  with  the  auto¬ 
mated  test  data,  all  problems  were  recorded  in  the  data 
processor’s  Problem  Log  and  the  test  sites  were  contacted 
for  problem  resolution. 

Data  Handling  of  Assessor  Biographical  Data  and 
Criterion  Assessment  Rating  Sheets:  As  discussed  ear¬ 
lier,  the  automated  DTF  file  contained  information 
recorded  on  the  first  page  of  the  DTF  form  describing 
the  participant  data  transmitted  from  the  site.  The 
second  page  of  the  hard  copy  DTF  contained  informa¬ 
tion  on  assessor  data — specifically,  whether  a  Confiden¬ 
tial  Envelope,  which  contained  the  Criterion  Rating 
Assessment  Sheet(s)  (CARS),  and  an  Assessor  Biographi¬ 
cal  Form  were  present  in  the  data  transmittal  package. 
HumRRO  handled  assessor  biographical  data  and  the 
Criterion  Rating  Assessment  Sheets  during  AT-SAT  1; 
these  hard  copy  instruments  were  processed  by  PDRI 
during  AT-SAT  2.  As  with  other  types  of  data,  to  ensure 
that  all  collected  assessor  information  was  actually  trans¬ 
mitted,  stage  one  processors  compared  the  assessor  data 
contained  in  each  data  transmittal  package  to  the  infor¬ 
mation  contained  on  the  DTF.  Test  sites  were  informed 
of  all  discrepancies  by  e-mailed  memoranda  or  telephone 
communication  and  were  asked  to  provide  a  resolution 
for  each  discrepancy.  Because  the  assessors  were  often 
asked  to  provide  CARS  ratings  and  complete  the  Asses¬ 
sor  Biographical  Data  Form  at  the  same  time,  they  often 
included  the  biographical  form  in  the  Confidential 
Envelope  along  with  the  CARS.  As  a  consequence,  the 
test  site  administrator  did  not  have  first-hand  knowledge 
of  which  forms  were  contained  in  the  envelopes.  In 
processing  the  hard  copy  assessor  data,  there  were  a  total 


10  The  23,107  files  were  comprised  of  the  CBPM  test,  the  13  predictor  tests,  and  one  start-up  (ST)  file  for  controller 
examinees  and  13  predictor  tests,  and  one  start-up  (ST)  file  for  pseudo-applicants. 
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of  2911  assessor  discrepancies  between  the  data  actually 
received  and  the  data  the  DTF  indicated  should  have 
been  received.  Of  these  29,  only  four  discrepancies  could 
not  be  resolved.  In  these  instances  the  assessor  simply 
may  not  have  included  in  the  Confidential  Envelope  the 
forms  that  the  administrator  thought  were  included. 

Data  Automation.  Hard  copy  forms  that  passed 
through  to  stage  two  processing  were  photocopied  and 
the  originals  filed  awaiting  automation.  Since  there  were 
no  other  copies  of  these  data,  photocopies  insured  against 
their  irrevocable  loss,  particularly  once  they  were  sent  to 
key-punch.  All  original  and  photocopied  Request  for 
SSN  Forms  were  stored  in  a  locked  cabinet.  Five  separate 
ASCII  key  entry  specifications  were  developed  by  the 
AT-SAT  database  manager:  for  the  three  biographical 
data  instruments,  the  CARS  form,  and  the  Request  for 
SSN  Form.  The  database  manager  worked  closely  with 
the  data  automation  company  chosen  to  key  enter  the 
data.  The  data  were  double-keyed  to  ensure  accuracy. 
Once  the  data  were  keyed  and  returned,  the  total  number 
of  cases  key  entered  were  verified  against  the  total  num¬ 
ber  of  hard  copy  forms  sent  to  key-punch.  Data  were  sent 
to  key-punch  in  three  installments  during  the  course  of 
AT-SAT  1  testing;  a  small  fourth  installment  comprised 
of  last  minute  “stragglers”  was  keyed  in-house.  CAR  and 
assessor  biographical  AT-SAT  2  data  were  sent  to  key¬ 
punch  in  two  installments  during  testing  and  a  small 
third  installment  of  “stragglers”  was  keyed  in-house  by 
PDRI.  In  AT-SAT  1,  automated  files  containing  asses¬ 
sor  and  participant  biographical  data  and  criterion  rat¬ 
ings  data  were  sent  to  PDRI  a  few  times  during  the  course 
of  testing;  complete  datasets  were  transmitted  when 
testing  was  concluded. 

Historical  Data 

Confidentiality  of  test  participants  was  a  primary 
concern  in  developing  a  strategy  for  obtaining  historical 
data  from  the  FAA  computer  archives  and  linking  that 
data  to  other  AT-SAT  datasets.  Specifically,  the  objec¬ 
tive  was  to  ensure  that  the  link  between  test  examinees 
and  controllers  was  not  revealed  to  the  FAA,  so  that  test 
results  could  never  be  associated  with  a  particular  em¬ 
ployee.  Also,  although  the  FAA  needed  participant  con¬ 
troller  Social  Security  Numbers  (SSN)  to  identify  and 
extract  cases  from  their  historical  archives,  these  SSNs 


could  not  be  returned  once  the  historical  information 
had  been  extracted.  Therefore,  examinee  number  or  SSN 
could  not  be  used  as  the  link  between  records  in  the 
historical  data  and  the  other  AT-SAT  data  collected.  To 
overcome  this  problem,  a  unique  random  identification 
number  was  generated  for  each  controller  examinee  who 
submitted  a  Request  for  SSN  form  in  AT-SAT  1  and  2. 
Electronic  files  containing  the  SSN,  this  random  identi¬ 
fication  number,  and  site  number  were  sent  to  the  FAA. 
Of  the  986  controllers  who  submitted  a  Request  for  SSN 
Form,  967  had  non-missing  SSNs  that  could  be  linked 
to  the  FAA  archival  data.  In  addition  to  these  967  SSNs, 
the  FAA  received  4  SSN  Forms  during  the  high  fidelity 
testing  in  Oklahoma  City,  which  increased  the  number 
of  cases  with  historical  data  to  971. 

Pseudo-Applicant  ASVAB  Data 

AFQT  scores  and  composite  measures  of  ASVAB 
subtests  G  (General),  A  (Administrative),  M  (Mechani¬ 
cal)  ,  and  E  (Electronic)  were  obtained  for  Kessler  pseudo¬ 
applicants  and  merged  with  test  and  biographical  data 
during  stage  three  data  processing. 

Integration  of  AT-SAT  Data 

The  goal  in  designing  the  final  AT-SAT  database  was 
to  create  a  main  dataset  that  could  be  used  to  address 
most  analytic  needs,  with  satellite  datasets  providing 
more  detailed  information  in  specific  areas.  Before  the 
database  could  be  created,  data  processors  needed  to 
perform  diagnostic  assessments  of  the  accuracy  of  the 
data  and  edit  the  data  on  the  basis  of  those  assessments. 
“Stage  three”  data  processing  activities  included  these 
diagnostic  data  checks  and  edits,  as  well  as  data  merging 
and  archive. 

Data  Diagnostics  and  Edits 

Since  the  data  contained  on  the  test  files  were  written 
by  test  software  that  was  generally  performing  as  ex¬ 
pected,  there  were  no  errors  in  data  recordation,  and 
therefore  no  need  for  large-scale  data  editing.  There  were 
two  types  of  diagnostic  checks  to  which  the  test  files  were 
subjected,  however.  First,  a  check  was  made  to  see 
whether  an  examinee  had  taken  the  same  test  more  than 
once.  It  is  a  testament  to  the  diligent  work  of  the  test  sites 
and  the  data  processors  that  this  anomaly  was  not  evident 


11  The  total  number  of  assessor  discrepancies  e-mailed  to  sites  was  41.  For  12  participant  assessors,  the  test  administrator 
indicated  the  presence  of  an  assessor  biographical  form  on  the  DTF  when  a  participant  biographical  form  had  actually  been 
completed.  Therefore,  the  number  of  true  assessor  discrepancies  was  29. 
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in  the  data.  Second,  the  test  analysts  performed  diagnos¬ 
tics  to  identify  observations  that  might  be  excluded  from 
further  analysis,  such  as  those  examinees  exhibiting 
motivational  problems.  Obviously,  historical  data  from 
the  FAA  archives  were  not  edited.  Data  collected  on  hard 
copy  instruments  were  subjected  to  numerous  internal 
and  external  diagnostic  and  consistency  checks  and 
programmatic  data  editing.  A  primary  goal  in  data 
editing  was  to  salvage  as  much  of  the  data  as  possible 
without  jeopardizing  accuracy. 

Participant  Biographical  Data.  Several  different  types 
of  problems  were  encountered  with  the  participant  bio¬ 
graphical  data: 

•  More  than  one  biographical  information  form  com¬ 
pleted  by  the  same  participant 

•  Missing  or  out-of-range  examinee  identification  number 

•  Out-of-range  date  values 

First,  to  correct  the  problem  of  duplicate12  biographi¬ 
cal  forms  for  the  same  examinee,  all  forms  completed 
after  the  first  were  deleted.  Second,  information  from  the 
DTF  sent  with  the  biographical  form  often  made  it 
possible  to  identify  missing  examinee  numbers  through 
a  process  of  elimination.  Investigation  of  some  out-of- 
range  examinee  numbers  revealed  that  the  digits  had 
been  transposed  at  the  test  site.  Third,  out-of-range  date 
values  were  either  edited  to  the  known  correct  value  or  set 
to  missing  when  the  correct  value  was  unknown. 

Other  data  edits  were  performed  on  the  controller  and 
pseudo-applicant  participant  biographical  data.  A  num¬ 
ber  of  examinees  addressed  the  question  of  racial/ethnic 
background  by  responding  “Other”  and  provided  open- 
ended  information  in  the  space  allowed.  In  many  cases, 
the  group  affiliation  specified  in  the  open-ended  re¬ 
sponse  could  be  re-coded  to  one  of  the  five  specific 
alternatives  provided  by  the  item  (i.e.,  Native  American/ 
Alaskan  Native,  Asian/Pacific  Islander,  African  Ameri¬ 
can,  Hispanic,  or  Non-Minority).  In  these  cases,  the 
open-ended  responses  were  recoded  to  one  of  the  close- 
ended  item  alternatives.  In  other  cases,  a  sixth  racial 
category,  mixed  race,  was  created  and  applicable  open- 
ended  responses  were  coded  as  such. 

Two  types  of  edits  were  applicable  only  to  the  control¬ 
ler  sample.  First,  in  biographical  items  that  dealt  with  the 
length  of  time  (months  and  years)  that  the  controller  had 


been  performing  various  duties,  when  only  the  month  or 
year  component  was  missing,  the  missing  item  was 
coded  as  zero.  Also,  for  consistency,  year  was  always 
made  to  be  included  in  the  year,  rather  than  month  (e.g., 
24  months),  field.  When  year  was  reported  in  the  month 
field,  the  year  field  was  incremented  by  the  appropriate 
amount  and  the  month  field  re-coded  to  reflect  any 
remaining  time  less  than  that  year(s). 

Second,  a  suspiciously  large  group  of  controller  par¬ 
ticipants  reported  their  race  as  “Native  American/Alas¬ 
kan  Native”  on  the  biographical  form.  To  check  the 
accuracy  of  self-reported  race,  the  responses  were  com¬ 
pared  to  the  race/ethnic  variable  on  the  historical  FAA 
archive  data.  For  those  controllers  with  historical  data, 
racial  affiliation  from  the  FAA  archives  was  used  rather 
than  self-reported  race  as  a  final  indication  of  controller 
race.  The  following  frequencies  of  race  from  these  two 
sources  of  information  show  some  of  the  discrepancies 
(Source  1  represents  self-reported  race  from  biographical 
form  only,  and  Source  2  represents  race  based  on  archival 
race  when  available  and  self  reported  race,  when  it  was 
not).  Using  Source  1,  there  were  77  Native  American/ 
Alaskan,  compared  to  23  using  Source  2.  Similarly  there 
were  9  and  7  Asian/Pacific  Islander  respectively  (Source 
1  is  always  given  first),  95  and  98  African  Americans,  64 
and  61  Hispanic,  804  and  890  Non-Minority,  20  and  8 
Other ,  and  4  and  1  Mixed  Race.  This  gives  a  total  of 
1073  participants  by  Source  1  and  1088  by  Source  2, 
with  159  Source  1  and  144  missing  Source  2  data. 
(Counts  for  Other  were  produced  after  “Other”  was  re¬ 
coded  into  one  of  the  five  close-ended  specified  item 
alternatives  whenever  possible.) 

All  edits  were  performed  programmatically,  with  hard 
copy  documentation  supporting  each  edit  maintained  in 
a  separate  log.  In  33  cases,  participant  assessors  com¬ 
pleted  only  assessor  rather  than  participant  biographical 
forms.  In  these  cases,  biographical  information  from  the 
assessor  form  was  used  for  participants. 

Assessor  Biographical  Data.  Like  the  participant 
data,  the  assessor  biographical  data  required  substantial  data 
cleaning.  The  problems  encountered  were  as  follows: 

•  More  than  one  biographical  information  form  com¬ 
pleted  by  the  same  assessor 

•  Incorrect  assessor  identification  numbers 

•  Out-of-range  date  values 


12  The  word  “duplicate”  here  does  not  necessarily  mean  identical,  but  simply  that  more  than  one  form  was  completed  by  a 
single  participant.  More  often  than  not,  the  “duplicate”  forms  completed  by  the  same  participant  were  not  identical. 
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First,  the  same  rule  formulated  for  participants,  delet¬ 
ing  all  duplicate  biographical  records  completed  after  the 
first,  was  applied.  Second,  by  consulting  the  site  Master 
Rosters  and  other  materials,  misassigned  or  miskeyed13 
rater  identification  numbers  could  be  corrected.  Third, 
out-of-range  date  values  were  either  edited  to  the  known 
correct  value  (i.e.,  the  year  that  all  biographical  forms 
were  completed  was  1997)  or  set  to  missing  when  the 
correct  value  was  unknown. 

In  addition  to  data  corrections,  the  race  and  time 
fields  in  the  assessor  data  were  edited  following  the 
procedures  established  in  the  participant  biographical 
data.  Open-ended  responses  to  the  racial/ethnic  back¬ 
ground  item  were  re-coded  to  a  close-ended  alternative 
whenever  possible.  In  addition,  when  only  the  month  or 
year  component  in  the  “time”  fields  was  missing,  the 
missing  item  was  coded  as  zero.  When  full  years  were 
reported  in  the  month  field  (e.g.,  24  months),  the  year 
field  was  incremented  by  the  appropriate  amount  and 
the  month  field  re-coded  to  reflect  any  remaining  time 
less  than  a  year. 

Since  the  test  sites  were  instructed  to  give  participants 
who  were  also  assessors  a  participant,  rather  than  asses¬ 
sor,  biographical  form,  data  processors  also  looked  for 
biographical  information  on  raters  among  the  partici¬ 
pant  data.  Specifically,  if  an  assessor  who  provided  a 
CARS  for  at  least  one  participant  did  not  have  an  assessor 
biographical  form,  participant  biographical  data  for  that 
assessor  were  used,  when  available 

Criterion  Ratings  Data.  Of  all  the  hard  copy  data 
collected,  the  CARS  data  required  the  most  extensive 
data  checking  and  editing.  N umerous  consistency  checks 
were  performed  within  the  CARS  dataset  itself  (e.g., 
duplicate  rater/ratee  combinations),  as  well  as  assessing 
its  consistency  with  other  datasets  (e.g.,  assessor  bio¬ 
graphical  data).  All  edits  were  performed  programmati¬ 
cally,  with  hard  copy  documentation  supporting  each 
edit  maintained  in  a  separate  log.  The  following  types  of 
problems  were  encountered: 

•  Missing  or  incorrect  examinee/rater  numbers 

•  Missing  rater/ratee  relationship 

•  Duplicate  rater/ratee  combinations 

•  Rater/ratee  pairs  with  missing  or  outlier  ratings  or 
involved  in  severe  DAL  entries 

•  Out-of-range  date  values 


First,  the  vast  majority  of  missing  or  incorrect  identi¬ 
fication  numbers  and/or  rater/ratee  relationships  were 
corrected  by  referring  back  to  the  hard  copy  source  and / 
or  other  records.  In  some  cases  the  test  site  manager  was 
contacted  for  assistance.  Since  the  goal  was  to  salvage  as 
much  data  as  possible,  examinee/rater  numbers  were 
filled  in  or  corrected  whenever  possible  by  using  records 
maintained  at  the  sites,  such  as  the  Master  Roster. 
Problems  with  identification  numbers  often  originated 
in  the  field,  although  some  key-punch  errors  occurred 
despite  the  double-key  procedure.  Since  examinee  num¬ 
ber  on  a  CARS  record  was  essential  for  analytic  purposes, 
six  cases  were  deleted  where  examinee  number  was  still 
unknown  after  all  avenues  of  information  had  been 
exhausted. 

Second,  some  raters  provided  ratings  for  the  same 
examinee  more  than  once,  producing  records  with  dupli¬ 
cate  rater/ratee  combinations.  In  these  cases,  hard  copy 
sources  were  reviewed  to  determine  which  rating  sheet 
the  rater  had  completed  first;  all  ratings  produced 
subsequently  for  that  particular  rater/ratee  combina¬ 
tion  were  deleted. 

Third,  some  cases  were  deleted  based  on  specific 
direction  from  data  analysts  once  the  data  had  been 
scrutinized.  These  included  rater/ratee  combinations 
with  more  than  3  of  the  1 1  rating  dimensions  missing, 
outlier  ratings,  ratings  dropped  due  to  information  in  the 
Problem  Logs,  or  incorrect  assignment  of  raters  to  ratees 
(e.g.,  raters  who  had  not  observed  ratees  controlling 
traffic) .  Fourth,  CARS  items  that  dealt  with  the  length  of 
time  (months  and  years)  that  the  rater  had  worked  with 
the  ratee  were  edited,  so  that  when  only  the  month  or 
year  component  was  missing,  the  missing  item  was 
coded  as  zero.  Where  full  years  were  reported  in  the 
month  field,  the  year  field  was  incremented  and  the 
month  field  re-coded  to  reflect  any  remaining  time. 

AT-SAT  Database 

As  stated  above,  the  database  management  plan  called 
for  a  main  AT-SAT  dataset  that  could  address  most 
analytic  needs,  with  satellite  datasets  that  could  provide 
detailed  information  in  specific  areas.  The  AT-SAT 
Database,  containing  data  from  the  alpha  and  beta  tests, 
is  presented  in  Figure  5.3.1.  To  avoid  redundancy, 
datasets  that  are  completely  contained  within  other 
datasets  are  not  presented  separately  in  the  AT-SAT 


13  The  miskeying  was  often  the  result  of  illegible  handwriting  on  the  hard  copy  forms. 
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Database.  For  example,  since  participant  biographical 
data  is  completely  contained  in  the  final  summary  dataset, 
it  is  not  provided  as  a  separate  satellite  dataset  in  the  AT- 
SAT  Database.  Similarly,  since  the  rater  biographical 
data  contains  all  the  data  recorded  on  the  assessor  bio¬ 
graphical  form,  as  well  as  some  participant  forms,  the 
assessor  biographical  form  is  not  listed  as  a  separate 
dataset  in  the  AT-SAT  Database.  All  data  processing  for 
the  AT-SAT  Database  was  done  in  the  Statistical  Analy¬ 
sis  System  (SAS).  The  datasets  contained  in  the  archived 
AT-SAT  Database  were  stored  as  portable  Statistical 
Package  for  the  Social  Sciences  (SPSS)  files. 

Alpha  Data.  The  Alpha  data  consist  of  a  summary 
dataset  as  well  as  scored  item  level  test  data  from  the 
Pensacola  study  conducted  in  the  spring  of  1 997.  Scored 
test  data  and  biographical  information  are  stored  in  the 
summary  dataset  called  “SUMMARY.POR”.  Item  level 
scored  test  data  are  contained  in  14  individual  files 
named  “xx_ITEMS.POR”,  where  xx  is  the  predictor  test 
acronym;  an  additional  1 5th  file  called  ASJTEMS.POR 
contains  ASVAB  test  scores. 

Beta  Test  Data.  The  Final  Analytic  Summary  Data 
file  in  the  AT -SAT  database  is  comprised  of  a  number  of 
different  types  of  data: 

•  Subset  of  scored  test  variables 

•  Complete  historical  FAA  archive  data 

•  Participant  biographical  information 

•  ASVAB  data  for  Keesler  participants 

•  Information  on  rater  identification  numbers 

As  stated  previously,  HumRRO,  RGI,  and  PDRI 
were  each  responsible  for  developing  and  analyzing 
specific  tests  in  the  beta  test  battery.  The  results  of  these 
analyses  are  presented  in  detail  elsewhere  in  this  report. 
Once  the  tests  had  been  scored,  each  organization  re¬ 
turned  the  scored  item-level  data  to  the  AT-SAT  data¬ 
base  manager.  Salient  scored  variables  were  extracted 
from  each  of  these  files  and  were  linked  together  by 
examinee  number.  This  created  an  examinee-level  dataset 
with  a  single  record  containing  test  information  for  each 
examinee.  Participant  biographical  data  and  historical 
FAA  archive  data  were  merged  to  this  record,  also  by 
examinee  number.  For  Keesler  pseudo-applicants, 


ASVAB  data  were  added.  Participants  for  whom  at  least 
one  CARS  had  been  completed  also  had  variable(s) 
appended  to  their  main  record  containing  the  identi¬ 
fication  number  of  their  assessor(s),  so  that  examinee- 
level  and  assessor-level  data  can  be  easily  linked.  Test 
variable  names  always  begin  with  the  two  letter  test 
acronym;  the  names  of  biographical  items  in  this  data 
file  begin  with  “BI”. 

This  main  analysis  dataset  is  called  XFINDAT5.POR 
and  contains  1,752  cases  with  1,466  variables.14 

The  satellite  test  and  rating  data  in  the  AT-SAT 
Database  are  comprised  of  three  types  of  files.  The  first 
group  consists  of  the  23,107  raw  ASCII  examinee  test 
(predictor  and  CBPM)  files  stored  in  weekly  data  pro¬ 
cessing  directories.  The  processing  of  these  data  is  de¬ 
scribed  in  the  subsection,  Initial  Data  Processing, 
Automated  Test  Files.  These  raw  files  are  included  in  the 
AT-SAT  Database  primarily  for  archival  purposes.  Sec¬ 
ond,  there  is  the  electronic  edited  version  of  the  CARS 
hard  copy  data,  called  CAR.POR,  which  is  described  in 
the  subsection,  Initial  Data  Processing,  Hard  Copy 
Data.  This  file  is  also  included  in  the  AT-SAT  Database 
mainly  for  purposes  of  data  archive.  The  third  group  of 
files  contains  complete  scored  item-level  test  data  for 
examinees,  derived  from  the  first  two  types  of  data  files 
listed  above.  The  predictor  scored  item-level  files  (e.g., 
EQJTEM.POR,  AM  JTEMS.POR)  were  derived  from 
the  raw  ASCII  predictor  test  files;  the  criterion  file 
(CR_ITEMS.POR)  was  derived  from  raw  CBPM  test 
files  and  the  CAR  data.15  Salient  variables  from  these 
scored  item-level  test  files  constitute  the  test  data  in  the 
analytic  summary  file  XFINDAT5.POR. 

Biographical  Data  were  also  included  in  the  beta  test 
datasets.  Complete  examinee  biographical  data  are  con¬ 
tained  in  the  analytic  summary  file  XFINDAT5.POR 
and  are,  therefore,  not  provided  as  a  separate  file  in  the 
database.  Biographical  information  on  assessors  only 
and  participant  assessors  is  contained  in  the  dataset 
called  XBRATER.POR  and  is  described  in  the  subsec¬ 
tion,  Initial  Data  Processing,  Hard  Copy  Data. 

Data  Archive.  The  AT-SAT  database  described  above 
is  archived  on  CD-ROM.  Figure  5-3.2  outlines  the 
directory  structure  for  the  AT-SAT  CD-ROM  data 
archive.  The  root  directory  contains  a  README.TXT 


14  The  following  FAA-applied  alphanumeric  variables  were  assigned  an  SPSS  system  missing  value  when  the  original  value 
consisted  of  a  blank  string:  CFAC,  FAC,  FORM,  IOPT,  OPT  ROPT,  STATSPEC,  TTYPE  ,  and  CDATE.  The  following 
FAA-supplied  variables  were  dropped  since  they  contained  missing  values  for  all  cases:  REG,  DATECLRD,  EOD, 
FAIL16PF,  P_P,  and  YR. 

15  This  file  also  contains  scored  High  Fidelity  test  data. 
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file  that  provides  a  brief  description  of  the  t;  it  also 
contains  two  subdirectories.  The  first  subdirectory  con¬ 
tains  Alpha  data,  while  the  second  contains  data  for  the 
Beta  analysis.  Within  the  Alpha  subdirectory,  there  are 
two  subdirectories,  “Final  Summary  Data”  and  “Exam¬ 
inee  Item  Level  Scored  Data”,  each  of  which  contain  data 
files.  The  Beta  subdirectory  contains  the  following 
subdirectories: 

•  Edited  Criterion  Assessment  Rating  Sheets 

•  Edited  Rater  Biodata  Forms 

•  Examinee  Item  Level  Scored  Test  Data 

•  Final  Analytic  Summary  Data 

•  Raw  Examinee  Test  Data  in  Weekly  Subdirectories 

•  Scaled,  Imputed,  and  Standardized  Test  Scores 


Each  Beta  subdirectory  contains  data  files.  In  addi¬ 
tion,  the  “Final  Analytic  Summary  Data”  subdirectory 
contains  a  codebook  for  XFINDAT5.POR.  The 
codebook  consists  of  two  volumes  that  are  stored  as 
Microsoft  Word  files  CBK1.DOC  and  CBK2.DOC 
The  CBK1.DOC  file  contains  variable  information 
generated  from  an  SPSS  SYSFILE  INFO.  It  also  con¬ 
tains  a  Table  of  Contents  to  the  SYSFILE  INFO  for  ease 
of  reference.  The  CBK2.DOC  file  contains  frequency 
distributions  for  discrete  variables,  means  for  continu¬ 
ous  data  elements,  and  a  Table  of  Contents  to  these 
descriptive  statistics.16 


16 


Means  were  generated  on  numeric  FAA-generated  historical  variables  unless  they  were  clearly  discrete. 


29 


CHAPTER  5.4 


Biographical  and  Computer  Experience  Information: 
Demographics  for  the  Validation  Study 

Patricia  A.  Keenan,  HumRRO 


This  chapter  presents  first,  the  demographic  charac¬ 
teristics  of  the  participants  in  both  the  concurrent  vali¬ 
dation  and  the  pseudo-applicant  samples.  The  data  on 
the  controller  sample  are  presented  first,  followed  by  the 
pseudo-applicant  information.  The  latter  data  divided 
between  civilian  and  military  participants.  It  should  be 
noted  that  not  all  participants  answered  each  question  in 
the  biographical  information  form,  so  at  times  the  num¬ 
bers  will  vary  or  cumulative  counts  may  not  total  1 00%. 

TOTAL  SAMPLE 

Participant  Demographics 

A  total  of  1,752  individuals  took  part  in  the  study 
(incumbents  and  pseudo-applicants);  1,265  of  the  par¬ 
ticipants  were  male  (72.2%)  and  342  were  female 
(19.5%).  1 45  participants  did  not  indicate  their  gender; 
149  did  not  identify  their  ethnicity.  The  cross-tabula¬ 
tion  of  ethnicity  and  gender,  presented  in  Table  5.4.1, 
represents  only  those  individuals  who  provided  com¬ 
plete  information  about  both  their  race  and  gender. 

The  sample  included  incumbent  FAA  controllers, 
supervisors  and  staff  (Controller  sample)  as  well  as 
pseudo-applicants  from  Keesler  Air  Force  base  (Military 
PA  sample)  and  civilian  volunteers  from  across  the 
country  (Civilian  PA  sample).  The  pseudo-applicants 
were  selected  based  on  demographic  similarity  to  ex¬ 
pected  applicants  to  the  controller  position.  The  esti¬ 
mated  average  age  of  the  total  sample  was  33.14  years 
(SD  =  8.43).  Ages  ranged  from  18  to  60  years.  This 
number  was  calculated  based  on  the  information  from 
1 ,583  participants;  1 69  people  did  not  provide  informa¬ 
tion  about  their  date  of  birth  and  were  not  included  in 
this  average. 

Participants  were  asked  to  identify  the  highest  level  of 
education  they  had  received.  Table  5.4.2  presents  a 
breakdown  of  the  educational  experience  for  all  partici¬ 
pants.  (151  people  did  not  provide  information  about 
their  educational  background.)  The  data  were  collected 
at  18  locations  around  the  U.S.  Table  5.4.3  shows  the 
number  of  participants  who  tested  at  each  facility. 


CONTROLLER  SAMPLE 

Participant  Demographics 

A  total  of  1 ,232  FAA  air  traffic  controllers  took  part 
in  the  concurrent  validation  study.  912  controllers  were 
male(83.7%),  177  controllers  were  female  (16.3%).  143 
participants  did  not  specify  their  gender  so  their  partici¬ 
pation  is  not  reflected  in  analyses.  The  majority  of  the 
data  was  collected  in  1 997.  A  supplementary  data  collec¬ 
tion  was  conducted  in  1998  to  increase  the  minority 
representation  in  the  sample.  A  total  of  1 ,08 1  controllers 
participated  in  the  1997  data  collection;  151  additional 
controllers  participated  in  1998.  Table  5.4.4  shows  the 
cross-tabulation  of  race  and  gender  distribution  for  the 
1997  and  1998  samples,  as  well  as  the  combined  num¬ 
bers  across  both  years.  143  individuals  did  not  report 
their  gender  and  144  did  not  report  their  race.  These 
individuals  are  not  reflected  in  Table  5.4.4.  The  average 
age  of  the  controllers  was  37.47  (SD  =  5.98),  with  ages 
ranging  from  25  to  60  years.  The  mean  was  based  on 
information  provided  by  1,079  of  the  participants;  age 
could  not  be  calculated  for  153  participants. 

Also  of  interest  was  the  educational  background  of  the 
controllers.  Table  5.4.5  shows  the  highest  level  of  educa¬ 
tion  achieved  by  the  respondents.  No  information  on 
education  was  provided  by  145  controllers. 

Professional  Experience 

The  controllers  represented  17  enroute  facilities.  The 
locations  of  the  facilities  and  the  number  of  controller 
participants  at  each  one  are  shown  in  Table  5.4.6.  A  total 
of  1,218  controllers  identified  the  facility  at  which  they 
are  assigned;  14  did  not  identify  their  facility. 

One  goal  of  the  study  was  to  have  a  sample  composed 
of  a  large  majority  of  individuals  with  air  traffic  experi¬ 
ence,  as  opposed  to  supervisors  or  staff  personnel.  For 
this  reason,  participants  were  asked  to  identify  both  their 
current  and  previous  positions.  This  would  allow  us  to 
identify  everyone  who  had  current  or  previous  experi¬ 
ence  in  air  traffic  control.  Table  5.4.7  indicates  the 
average  number  of  years  the  incumbents  in  each  job 
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category  had  been  in  their  current  position.  1 42  control¬ 
lers  did  not  indicate  their  current  position.  The  air  traffic 
controller  participant  sample  included  journeyman  con¬ 
trollers,  developmental  controllers,  staff  and  supervisors, 
as  well  as  holding  several  “other”  positions.  These  “other” 
positions  included  jobs  described  as  Traffic  Manage¬ 
ment  Coordinator. 

Overall,  the  participants  indicated  they  had  spent  an 
average  of  4.15  years  in  their  previous  position.  These 
positions  included  time  as  journeyman  controller,  devel¬ 
opmental  controller,  staff,  supervisor  or  other  position. 
Those  responding  “Other”  included  cooperative  educa¬ 
tion  students,  Academy  instructors,  and  former  Air 
Force  air  traffic  controllers. 

One  goal  of  the  biographical  information  form  was  to 
get  a  clear  picture  of  the  range  and  length  of  experience 
of  the  participants  in  the  study.  To  this  end  they  were 
asked  the  number  of  years  and  months  as  FPL,  staff,  or 
supervisor  in  their  current  facility  and  in  any  facility.  The 
results  are  summarized  in  Table  5.4.8.  Few  of  the  respon¬ 
dents  had  been  in  staff  or  supervisory  capacity  for  more 
than  a  few  months.  Half  of  the  respondents  had  never 
acted  in  a  staff  position  and  almost  two-thirds  had  never 
held  a  supervisory  position.  The  amount  of  staff  experi¬ 
ence  ranged  from  0  to  10  years,  with  97.6%  of  the 
participants  having  less  than  four  years  of  experience. 
The  findings  are  similar  for  supervisory  positions;  99% 
of  the  respondents  had  seven  or  fewer  years  of  experience. 
This  indicates  that  our  controller  sample  was  indeed 
largely  composed  of  individuals  with  current  or  previous 
controller  experience. 

Also  of  interest  was  the  amount  of  time  the  incum¬ 
bents  (both  controllers  and  supervisors)  spent  actually 
controlling  air  traffic.  Respondents  were  asked  how  they 
had  spent  their  work  time  over  the  past  six  months  and 
then  to  indicate  the  percentage  of  their  work  time  they 
spent  controlling  traffic  (i.e.,  “plugged-in  time”)  and  the 
percentage  they  spent  in  other  job-related  activities  (e.g., 
crew  briefings,  CIC  duties,  staff  work,  supervisory  du¬ 
ties).  The  respondents  indicated  that  they  spent  an 
average  of  72.41%  of  their  time  controlling  traffic  and 
23.33%  of  their  time  on  other  activities. 

PSEUDO-APPLICANT  SAMPLE 

A  total  of  5 1 8  individuals  served  as  pseudo-applicants 
in  the  validation  study;  258  individuals  from  Keesler  Air 
Force  Base  and  256  civilians  took  part  in  the  study.  The 
racial  and  gender  breakdown  of  these  samples  is  shown 
in  Table  5.4.9. 


COMPUTER  USE  AND  EXPERIENCE 
QUESTIONNAIRE 

To  determine  if  individual  familiarity  with  comput¬ 
ers  could  influence  their  scores  on  several  of  the  tests  in 
the  predictor  battery,  a  measure  of  computer  familiarity 
and  skill  was  included  as  part  of  the  background  items. 
The  Computer  Use  and  Experience  (CUE)  Scale,  devel¬ 
oped  by  Potosky  and  Bobko  (1997),  consists  of  12  5- 
point  Likert-type  items  (1=  Strongly  Disagree,  2  = 
Disagree,  3  =  Neither  Agree  nor  Disagree,  4  =  Agree,  5  = 
Strongly  Agree),  which  asked  participants  to  rate  their 
knowledge  of  various  uses  for  computers  and  the  extent 
to  which  they  used  computers  for  various  reasons.  In 
addition,  5  more  items  were  written  to  ask  participants 
about  actual  use  of  the  computer  for  such  purposes  as 
playing  games,  word  processing  and  using  e-mail.  The 
resulting  17-item  instrument  is  referred  to  in  this  report 
as  the  CUE-Plus. 

Item  Statistics 

The  means  and  standard  deviations  for  each  item  are 
presented  in  Table  5.4. 1 0.  The  information  reported  in 
the  table  includes  both  the  Air  Traffic  Controller  partici¬ 
pants  and  the  pseudo-applicants.  Overall,  the  respon¬ 
dents  show  familiarity  with  computers  and  use  them  to 
different  degrees.  Given  the  age  range  of  our  sample,  this 
is  to  be  expected.  As  might  be  expected,  they  are  fairly 
familiar  with  the  day-to-day  uses  of  computers,  such  as 
doing  word  processing  or  sending  email.  Table  5.4.1 1 
shows  the  item  means  and  standard  deviations  for  each 
sample,  breaking  out  the  civilian  and  military  pseudo¬ 
applicant  samples  and  the  controller  participants.  The 
means  for  the  samples  appear  to  be  fairly  similar.  Table 
5.4.12  shows  the  inter-item  correlations  of  the  CUE- 
Plus  items.  All  the  items  were  significantly  correlated 
with  each  other. 

Reliability  of  Cue-Plus 

Using  data  from  1 ,54 1  respondents,  the  original  12- 
item  CUE  Scale  yielded  a  reliability  coefficient  (alpha)  of 
.92.  The  scale  mean  was  36. 58  (SD  =  1 1.34).  The  CUE- 
Plus,  with  17  items  and  1,533  respondents,  had  a  reli¬ 
ability  coefficient  (alpha)  of  .94.  The  scale  mean  was 
51.47  (SD  =  16.11).  Given  the  high  intercorrelation 
between  the  items,  this  is  not  surprising.  The  item-total 
statistics  are  shown  in  Table  5.4.13.  There  is  a  high 
degree  of  redundancy  among  the  items.  The  reliability 
coefficient  for  the  samples  are  as  follows:  controllers,  .93, 
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civilian  pseudo-applicants,  .91,  and  military  pseudo 
applicants,  .93,  indicating  that  there  were  no  large  differ¬ 
ences  between  sub-groups  in  responding  to  the  CUE- 
Plus  items. 

Factor  Analysis 

Principal  components  analysis  indicated  that  CUE- 
Plus  had  two  factors,  but  examination  of  the  second 
factor  showed  that  it  made  no  logical  sense.  Varimax  and 
oblique  rotations  yielded  the  same  overall  results.  The 
item  “I  often  use  a  mainframe  computer  system”  did  not 
load  strongly  on  either  factor,  probably  because  few 
individuals  use  mainframe  computers.  The  varimax  ro¬ 
tation  showed  an  inter- factor  correlation  of  .75.  Table 
5.4. 14  shows  the  eigenvalues  and  percentages  of  variance 
accounted  for  by  the  factors.  The  eigenvalues  and  vari¬ 
ance  accounted  for  by  the  two-factor  solution  are  shown 
in  Table  5.4. 1 5.  The  first  factor  accounts  for  over  half  of 
the  variance  in  the  responses,  with  the  second  factor 
accounting  for  only  6%.  The  last  column  in  Table  5.4. 16 
shows  the  component  matrix  when  only  one  factor  was 
specified.  Taken  together,  the  data  suggests  that  one  factor 
would  be  the  simplest  explanation  for  the  data  structure. 

PERFORMANCE  DIFFERENCES 

Gender  Differences 

The  overall  mean  for  the  CUE-Plus  was  51.31  (SD 
=  16.09).  To  see  whether  males  performed  significantly 
different  than  females  on  the  CUE-Plus,  difference 
scores  were  computed  for  the  different  samples.  The 
difference  score  (d)  is  the  standardized  mean  difference 
between  males  and  females.  A  positive  value  indicates 
superior  performance  by  males.  The  results  are  reported 
in  Table  5.4.16.  For  all  samples,  males  scored  higher  on 
the  CUE  (i.e.,  were  more  familiar  with  or  used  comput¬ 
ers  for  a  wider  range  of  activities),  but  at  most,  these 
differences  were  only  moderate  (.04  to  .42). 

Ethnic  Differences 

Performance  differences  on  the  CUE-Plus  between 
ethnic  groups  were  also  investigated.  The  means,  stan¬ 
dard  deviations  and  difference  scores  (d)  for  each  group 
is  presented  in  Table  5.4.17.  The  table  is  split  out  by 
sample  type  (e.g.,  Controller,  Military  PA,  Civilian  PA). 
Comparisons  were  conducted  between  Caucasians  and 
three  comparison  groups:  African-Americans,  Hispan- 
ics,  and  all  non-Caucasian  participants.  A  positive  value 
indicates  superior  performance  by  Caucasians;  a  nega¬ 
tive  value  indicates  superior  performance  by  the  com¬ 


parison  group.  The  differences  were  very  low  to  moder¬ 
ate,  with  the  absolute  value  of  the  range  from  .04  to  .31. 
The  highest  d  scores  were  in  the  Military  PA  sample. 
Caucasians  scored  higher  than  the  comparison  groups  in 
all  cases  except  for  the  Civilian  PA,  in  which  African- 
Americans  scored  higher  than  Caucasians. 

Summary 

All  in  all,  these  results  show  the  CUE-Plus  to  have  very 
small  differences  for  both  gender  and  race.  To  the  extent 
that  the  instrument  predicts  scores  on  the  test  battery, 
test  differences  are  not  likely  to  be  attributable  to  com¬ 
puter  familiarity. 

RELATIONSHIP  BETWEEN  CUE-PLUS 
AND  PREDICTOR  SCORES 

Correlations 

An  argument  could  be  made  that  one’s  familiarity 
with  and  use  of  computers  could  influence  scores  on  the 
computerized  predictor  battery.  To  address  that  ques¬ 
tion,  correlations  between  the  individual  CUE-Plus 
items  and  the  CUE-Plus  total  score  with  the  AT-SAT 
predictor  scores  were  computed.  One  area  of  interest  is 
to  what  extent  computer  familiarity  will  affect  the  scores 
of  applicants.  To  better  examine  the  data  in  this  light,  the 
sample  was  separated  into  controllers  and  pseudo-appli¬ 
cants  and  separate  correlations  performed  for  the  two 
groups.  The  correlations  for  the  controller  sample  are 
shown  in  Tables  5.4.18  and  5.4.19.  Table  5.4.18  shows 
the  correlations  between  the  CUE  items  and  Applied 
Math,  Angles,  Air  T  raffic  Scenarios,  Analogy,  Dials,  and 
Scan  scores.  T able  5.4.19  shows  the  correlations  between 
CUE-Plus  and  Letter  Factory,  Memory,  Memory  Re¬ 
call,  Planes,  Sounds  and  Time-Wall  (TW)  scores.  T ables 
5.4.20  and  5.4.21  contain  the  same  information  for  the 
pseudo-applicant  sample.  In  general,  the  CUE-Plus  scores 
were  more  highly  correlated  with  performance  on  the 
AT-SAT  battery  for  the  pseudo-applicants  than  for  the 
controllers. 

The  CUE-Plus  total  score  was  correlated  (p  <  .05  or 
p  <  .01)  with  all  predictor  scores  with  the  exception  of 
those  for  Analogy:  Latency  and  Time-Wall:  Perceptual 
Speed  for  the  pseudo-applicants.  The  same  was  true  for 
the  controller  sample  with  regard  to  Air  Traffic  Sce¬ 
narios:  Accuracy,  Memory:  Number  Correct,  Recall: 
Number  Correct,  Planes:  Projection  and  Planes:  Time 
Sharing.  Given  the  widespread  use  of  computers  at  work 
and  school  and  the  use  of  Internet  services  this  rate  of 
correlation  is  not  surprising. 
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The  Letter  Factory  test  scores  on  Situational  Aware¬ 
ness  and  Planning  and  Thinking  Ahead  are  highly  cor¬ 
related  with  the  individual  CUE-Plus  items  for  the 
pseudo-applicants,  while  the  controllers’  Planning  and 
Thinking  Ahead  scores  were  more  often  correlated  with 
the  CUE-Plus  items  than  were  their  Awareness  scores. 
One  explanation  for  these  high  correlations  is  that  the 
more  comfortable  one  is  with  various  aspects  of  using  a 
computer,  the  more  cognitive  resources  can  allocated  for 
planning.  When  the  use  of  the  computer  is  automatic, 
more  concentration  can  be  focused  on  the  specific  task. 

The  Time-Wall  perception  scores  (Time  Estimate 
Accuracy  and  Perceptual  Accuracy)  are  highly  correlated 
with  the  individual  CUE  items  for  the  pseudo-appli¬ 
cants  and  correlated  to  a  lesser  extent  for  the  controllers. 
The  reverse  is  true  for  the  Perceptual  Speed  variable:  the 
controller  scores  are  almost  all  highly  correlated  with 
CUE-Plus  items,  while  only  two  of  the  items  are  corre¬ 
lated  for  the  pseudo-applicants.  The  Time-Wall  test  will 
not  be  included  in  the  final  test  battery,  so  this  is  not  a 
consideration  as  far  as  fairness  is  concerned. 

Using  a  mainframe  computer  correlated  with  only 
one  of  the  test  battery  scores  for  the  controller  sample, 
but  correlated  highly  with  several  test  scores  for  the 
pseudo-applicants.  The  fact  that  controllers  use  main¬ 
frames  in  their  work  probably  had  an  effect  on  their 
correlations. 

Regression  Analyses 

Regression  analyses  were  conducted  to  investigate  the 
extent  to  which  the  CUE-Plus  and  four  demographic 
variables  predict  test  performance.  The  dependent  vari¬ 
ables  predicted  were  the  measures  that  are  used  in  the  test 
battery.  Dummy  variables  for  race  were  calculated,  one 
to  compare  Caucasians  and  African-Americans,  one  to 
compare  Hispanics  to  Caucasians,  and  the  third  to 
compare  all  minorities  to  Caucasians.  Those  identified 
as  Caucasian  were  coded  as  1 ,  members  of  the  compari¬ 
son  groups  were  coded  as  0.  1 ,497  cases  were  analyzed. 
Thus,  five  variables  were  used  in  the  regression  analyses: 
three  “race”  variables,  education,  age,  gender  and  score 
on  CUE-Plus. 

Applied  Math 

The  variables  described  above  were  entered  as  predic¬ 
tors  for  the  total  number  of  items  correct.  For  all  three 
comparisons,  all  variables  were  included  in  the  final 
model.  That  model  accounted  for  approximately  20%  of 
the  variance  for  all  three  comparisons.  Gender  was  the 


best  predictor  of  performance.  Negative  b  weights  for 
gender  indicate  that  males  performed  better  than  fe¬ 
males.  The  positive  weights  for  age  indicate  that  the  older 
the  individual,  the  higher  their  score  on  the  Applied 
Math  test.  Education  and  CUE-Plus  score  were  also 
positively  weighted,  indicating  that  the  more  education 
one  received  and  the  more  familiar  one  is  with  comput¬ 
ers,  the  better  one  is  likely  to  do  on  the  Applied  Math  test. 
Caucasian  participants  scored  higher  than  did  their 
comparison  groups.  The  statistics  for  each  variable  en¬ 
tered  are  shown  in  Table  5.4.22. 

Angles  Test 

The  same  general  pattern  of  results  holds  true  for  the 
Angles  test.  Table  5.4.23  shows  the  statistics  for  each 
variable.  Age  was  not  a  predictor  of  performance  for  this 
test  in  any  of  the  comparisons.  The  other  variables  were 
predictive  for  the  Caucasian/African-American  and  the 
Caucasian/Minority  models.  Race  was  not  a  predictor 
for  the  Caucasian/Hispanic  model.  In  all  cases,  females 
performed  less  well  than  males.  Amount  of  education 
and  CUE-Plus  were  positive  indicators  of  performance. 
The  predictor  sets  accounted  for  about  10%  of  the 
variance  in  Angles  test  scores;,  the  CUE-Plus  score 
contributed  little  to  explaining  the  variance  in  scores. 

Air  Traffic  Scenarios 

The  predictor  variables  accounted  for  between  15% 
and  20%  of  the  variance  in  the  Efficiency  scores  (see 
Table  5.4.24),  but  only  about  3%  for  Safety  (Table 
5.4.25)  and  7%  for  Procedural  Accuracy  (Table  5.4.26). 
CUE-Plus  scores  were  predictive  of  performance  for  all 
three  variables,  but  not  particularly  strongly.  Age  was  a 
positive  predictor  of  performance  for  only  the  Proce¬ 
dural  Accuracy  variable.  Gender  was  a  predictor  for 
Efficiency  in  all  three  models,  but  not  consistently  for 
the  other  two  variables.  Education  predicted  only  Proce¬ 
dural  Accuracy.  Race  was  not  a  predictor  for  the  Cauca¬ 
sian/Hispanic  models,  although  it  was  for  the  other 
models. 

Analogy  Test 

Age  was  a  fairly  consistent  predictor  for  the  Informa¬ 
tion  Processing  (see  Table  5.4.27)  and  Reasoning  vari¬ 
ables  (see  Table  5.4.28),  although  it  did  not  predict 
Reasoning  performance  in  the  Caucasian/Minority  and 
Caucasian/African-American  equations.  Education  was 
a  negative  predictor  for  Information  Processing,  but  was 
positively  related  to  Reasoning.  CUE-Plus  was  a  predic- 
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tor  for  Reasoning,  but  not  for  Information  Processing. 
T ogether ,  the  independent  variables  accounted  for  about 
1 1  %  of  the  variance  in  the  Information  Processing  scores 
and  about  16%  of  the  Reasoning  scores. 

Dials  Test 

The  number  of  items  correct  on  the  Dials  test  was 
predicted  by  gender,  education,  race  and  CUE-Plus. 
Table  5.4.29  shows  the  statistics  associated  with  the 
analysis.  Males  are  predicted  to  score  higher  than  fe¬ 
males;  those  with  higher  education  are  predicted  to 
perform  better  on  the  test  than  those  with  less  education. 
Race  was  positively  related  with  Dials  scores,  indicating 
that  Caucasians  tended  to  score  higher  than  their  com¬ 
parison  groups.  CUE-Plus  was  a  significant,  but  weak 
predictor  for  the  Caucasian/Minority  and  Caucasian/ 
African-American  models.  It  did  not  predict  perfor¬ 
mance  in  the  Caucasian/Hispanic  model.  The  four 
variables  accounted  for  between  8%  and  10%  of  the 
variance  in  Dials  test  performance. 

Letter  Factory  Test 

The  Letter  Factory  test  had  two  scores  of  interest: 
Situational  Awareness  and  Planning  and  Thinking  Ahead. 
Age  and  gender  did  not  predict  for  either  score.  Race  and 
CUE-Plus  score  were  predictors  for  both  variables;  edu¬ 
cation  was  a  predictor  for  Situational  Awareness.  These 
variables  accounted  for  between  7%  and  12%  of  the 
variance  in  the  Situational  Awareness  score  (see  Table 
5.4.30)  and  11%  to  1 5%  of  the  variance  in  the  Planning 
and  Thinking  Ahead  score  (see  Table  5.4.31). 

Scan  Test 

The  variables  in  the  regression  equation  accounted  for 
only  1%  to  3%  of  the  variance  in  the  Scan  score  (see 
Table  5.4.32).  Education  was  a  positive  predictor  for  all 
three  equations.  Race  was  a  predictor  for  the  Caucasian/ 
African-American  model.  CUE-Plus  score  positively  pre¬ 
dicted  performance  in  the  Caucasian/Hispanic  equation. 

Summary 

The  question  of  interest  in  this  section  has  been  the 
extent  to  which  computer  familiarity,  as  measured  by 
CUE-Plus,  influences  performance  on  the  AT-SAT  test 
battery.  The  correlation  matrices  indicated  a  low  to 
moderate  level  of  relationship  between  CUE-Plus  and 
many  of  the  variables  in  the  pilot  test  battery  for  the 


controller  sample.  The  correlations  were  higher  for  the 
pseudo-applicant  sample.  To  further  investigate  these 
relationships,  regression  analyses  were  conducted  to  see 
how  well  Cue-Plus  and  other  relevant  demographic 
variables  predicted  performance  on  the  variables  that 
were  used  in  the  V  1.0  test  battery. 

The  results  showed  that  overall,  the  demographic 
variables  were  not  strong  predictors  of  test  performance. 
The  variables  accounted  for  relatively  little  of  the  vari¬ 
ance  in  the  test  scores.  CUE-Plus  was  identified  as  a 
predictor  for  nine  of  the  eleven  test  scores.  However, 
even  for  the  scores  where  CUE-Plus  was  the  strongest 
predictor  of  the  variables  entered,  it  accounted  for  no 
more  than  8%  of  the  variance  in  the  score.  In  most  of  the 
scores,  the  effect,  although  statistically  significant,  was 
realistically  negligible. 

SUMMARY 

This  chapter  described  the  participants  in  the  AT- 
SAT  validation  study.  The  participants  represented  both 
genders  and  the  U.S.  ethnicities  likely  to  form  the  pool 
of  applicants  for  the  Air  Traffic  Controller  position. 

In  addition  to  describing  the  demographic  character¬ 
istics  of  the  sample  on  which  the  test  battery  was  vali¬ 
dated,  this  chapter  also  described  a  measure  of  computer 
familiarity,  CUE.  CUE  was  developed  by  Potosky  and 
Bobko  (1997)  and  revised  for  this  effort  (CUE-Plus). 
The  CUE-Plus  is  a  highly  reliable  scale  (alpha  =  .92); 
factor  analysis  indicated  that  there  was  only  one  inter¬ 
pretable  factor.  Analysis  of  the  effect  of  gender  on  CUE- 
Plus  scored  showed  moderate  differences  for  the  controller 
sample,  none  for  the  pseudo-applicant  sample;  males 
scored  higher  on  the  CUE-Plus  than  did  females.  There 
were  also  small  to  moderate  differences  in  CUE-Plus  for 
ethnicity.  The  strongest  differences  were  found  in  the 
military  pseudo-applicant  sample. 

CUE-Plus  items  showed  a  moderate  to  high  correla¬ 
tion  with  the  variables  assessed  in  the  validation  study. 
The  CUE-Plus  was  also  shown  to  be  a  fairly  weak  but 
consistent  predictor  of  performance  on  the  variables  that 
were  included  in  V  1 .0  test  battery.  Although  there  were 
some  performance  differences  attributable  to  gender, 
race  and  computer  experience  none  of  these  were  ex¬ 
tremely  strong.  The  effects  of  computer  skill  would  be 
washed  out  by  recruiting  individuals  who  have  strong 
computer  skills. 
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CHAPTER  5.5 


Predictor-Criterion  Analyses 
Gordon  Waugh,  HumRRO 


Overview  of  the  Predictor-Criterion  Validity 
Analyses 

The  main  purpose  of  the  validity  analyses  was  to 
determine  the  relationship  of  AT-SAT  test  scores  to  air 
traffic  controller  job  performance.  Additional  goals  of 
the  project  included  selecting  tests  for  the  final  AT-SAT 
battery,  identifying  a  reasonable  cut  score,  and  the 
development  of  an  approach  to  combine  the  various  AT  - 
SAT  scores  into  a  single  final  score.  Several  steps  were 
performed  during  the  validity  analyses: 

•  Select  the  criteria  for  validation  analyses 

•  Compute  zero-order  validities  for  each  predictor  score 
and  test 

•  Compute  incremental  validities  for  each  test 

•  Determine  the  best  combination  of  tests  to  include  in 
the  final  battery 

•  Determine  how  to  weight  the  test  scores  and  compute 
the  predictor  composite  score 

•  Compute  the  validity  coefficients  for  the  predictor 
composite 

•  Correct  the  validity  coefficient  for  statistical  artifacts 

Many  criterion  scores  were  computed  during  the 
project.  It  was  impractical  to  use  all  of  these  scores  during 
the  validation  analyses.  Therefore,  a  few  of  these  scores 
had  to  be  selected  to  use  for  validation  purposes.  The 
three  types  of  criterion  measures  used  in  the  project  were 
the  CBPM  (Computer-Based  Performance  Measure), 
the  Behavior  Summary  Scales  (which  are  also  called 
Ratings  in  this  chapter),  and  the  HiFi  (High  Fidelity 
Performance  Measure) .  The  development,  dimensional¬ 
ity,  and  construct  validity  of  the  criteria  are  discussed  at 
length  in  Chapter  4  of  this  report. 

The  CBPM  was  a  medium  fidelity  simulation.  A 
computer  displayed  a  simulated  air  space  sector  while  the 
examinee  answered  questions  based  on  the  air  traffic 
scenario  shown.  The  Behavior  Summary  Scales  were 
performance  ratings  completed  by  the  examinee’s  peers 
and  supervisors.  The  HiFi  scores  were  based  upon 
observers’  comprehensive  ratings  of  the  examinee’s 
two-day  performance  on  a  high-fidelity  air  traffic 
control  simulator. 


Based  on  the  analyses  of  the  dimensions  underlying 
the  criteria,  it  was  concluded  that  the  criteria  space  could 
be  summarized  with  four  scores:  (a)  the  CBPM  score,  (b) 
a  single  composite  score  of  the  10  Behavior  Summary 
Scales  (computed  as  the  mean  of  the  10  scales),  (c)  HiFi 
1:  Core  Technical  score  (a  composite  of  several  scores) 
and  (d)  HiFi  2:  Controlling  Traffic  Safely  and  Effi¬ 
ciently  (a  composite  of  several  scores).  The  small  sample 
size  for  the  HiFi  measures  precluded  their  use  in  the 
selection  of  a  final  predictor  battery  and  computation  of 
the  predictor  composite.  They  were  used,  however,  in 
some  of  the  final  validity  analyses  as  a  comparison 
standard  for  the  other  criteria. 

A  single,  composite  criterion  was  computed  using  the 
CBPM  score  and  the  composite  Ratings  score.  Thus,  the 
following  three  criteria  were  used  for  the  validity  analy¬ 
ses:  (a)  the  CBPM  score,  (b)  the  composite  Ratings 
score,  and  (c)  the  composite  criterion  score. 

Zero-Order  Validities 

It  is  important  to  know  how  closely  each  predictor 
score  was  related  to  job  performance.  Only  the  predictor 
scores  related  to  the  criteria  are  useful  for  predicting  job 
performance.  In  addition,  it  is  often  wise  to  exclude  tests 
from  a  test  battery  if  their  scores  are  only  slightly  related 
to  the  criteria.  A  shorter  test  battery  is  cheaper  to  develop, 
maintain,  and  administer  and  is  more  enjoyable  for  the 
examinees. 

Therefore,  the  zero-order  correlation  was  computed 
between  each  predictor  score  and  each  of  the  three 
criteria  (CBPM,  Ratings,  and  Composite).  Because  some 
tests  produced  more  than  one  score,  the  multiple  corre¬ 
lation  of  each  criterion  with  the  set  of  scores  for  each 
multi-measure  test  was  also  computed.  This  allowed  the 
assessment  of  the  relationship  between  each  test,  as  a 
whole,  and  the  criteria.  These  correlations  are  shown  in 
Table  5.5.1  below. 

Ideally,  we  would  like  to  know  the  correlation  be¬ 
tween  the  predictors  and  the  criteria  among  job  appli¬ 
cants.  In  this  study,  however,  we  did  not  have  criteria 
information  for  the  applicants  (we  did  not  actually  use 
real  applicants  but  rather  pseudo-applicants) .  That  would 
require  a  predictive  study  design.  The  current  study  uses 
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a  concurrent  design:  We  computed  the  predictor-criteria 
correlations  using  current  controllers.  Correlations  are 
affected  by  the  amount  of  variation  in  the  scores.  Scales 
with  little  variation  among  the  scores  tend  to  have  low 
correlations  with  other  scales.  In  this  study,  the  variation 
in  the  predictor  scores  was  much  greater  among  the 
pseudo-applicants  than  among  the  controllers.  There¬ 
fore,  we  would  expect  the  correlations  to  be  higher 
within  the  pseudo-applicant  sample.  A  statistical  for¬ 
mula,  called  correction  for  range  restriction ,  was  used  to 
estimate  what  these  correlations  would  be  among  the 
pseudo-applicants.  The  formula  requires  three  values: 
(a)  the  uncorrected  correlation,  (b)  the  predictor's  stan¬ 
dard  deviation  for  the  pseudo-applicant  sample,  and  (c)  the 
predictor’s  standard  deviation  for  the  controller  sample. 

Table  5.5. 1  shows  both  the  corrected  and  uncorrected 
correlations.  The  amount  of  correction  varies  among  the 
predictors  because  the  ratio  of  the  pseudo-applicant  vs. 
controller  standard  deviations  also  varies.  The  greatest 
correction  occurs  for  predictors  which  exhibit  the  great¬ 
est  differences  in  standard  deviation  between  the  two 
samples  (e.g.,  Applied  Math).  The  least  correction  (or 
even  downward  correction)  occurs  for  predictors  whose 
standard  deviation  differs  little  between  the  two  samples 
(e.g.,  the  EQ  scales). 

Table  5.5.1  shows  that  most  of  the  tests  exhibit 
moderate  to  high  correlations  with  the  CBPM  and  low 
to  moderate  correlations  with  the  Ratings.  Some  scales, 
however  had  no  significant  (p  <  .05)  correlations  with 
the  criteria:  the  Information  Processing  Latency  scale 
from  the  Analogies  test  and  2  of  the  14  scales  from  the 
Experiences  Questionnaire  ( Tolerance  for  High  Intensity 
and  Taking  Charge).  In  addition,  these  two  EQ  scales 
along  with  the  EQ  scale,  Working  Cooperatively ,  corre¬ 
lated  negatively  with  the  CBPM  and  composite  criteria. 
Thus,  it  is  doubtful  that  these  scores  would  be  very  useful 
in  predicting  job  performance.  Analyses  of  their  incre¬ 
mental  validities,  discussed  below,  confirmed  that 
these  scores  do  not  significantly  improve  the  predic¬ 
tion  of  the  criteria. 

The  EQ  (Experiences  Questionnaire)  is  a  self-report 
personality  inventory.  It  is  not  surprising,  then,  that  its 
scales  do  not  perform  as  well  as  the  other  tests — which 
are  all  cognitive  measures — in  predicting  the  CBPM 
which  is  largely  a  cognitive  measure.  The  cognitive  tests 
were  generally  on  a  par  with  the  EQ  in  predicting  the 
Ratings  criterion.  A  notable  exception  was  the  Applied 
Math  test,  which  greatly  outperformed  all  other  tests  in 
predicting  either  the  CBPM  or  the  Ratings.  Note  that  the 


Ratings  criterion  is  a  unit-weighted  composite  of  the  10 
behavior  summary  scales  completed  by  supervisors.  The 
EQ  correlated  quite  highly  with  a  number  of  these 
behavior  summary  scales,  e.g.,  the  four  scales  making  up 
the  Technical  Effort  factor,  and  the  single  scale  in  the 
teamwork  factor,  but  not  very  highly  with  the  composite 
Ratings  criterion. 

Composure  and  Concentration  are  the  only  EQ  scales 
that  correlate  above  .08  with  the  CBPM,  whereas  eight 
scales  correlate  this  highly  with  the  Ratings.  This  is  not 
surprising  because  both  personality  measures  and  perfor¬ 
mance  ratings  incorporate  non-cognitive  performance 
tors  such  as  motivation.  The  moderate  size  of  the  mul¬ 
tiple  correlation  of  the  EQ  with  the  CBPM  of  .16  is 
misleadingly  high  because  three  of  the  EQ  scales  corre¬ 
late  negatively  with  the  CBPM.  The  size  of  a  multiple 
correlation  is  usually  just  as  large  when  some  of  the 
correlations  are  negative  as  when  all  are  positive.  Scales 
that  correlate  negatively  with  the  criterion,  however, 
should  not  be  used  in  a  test  battery.  Otherwise,  examin¬ 
ees  scoring  higher  on  these  scales  would  get  lower  scores 
on  the  battery.  When  the  three  scales  that  correlate 
negatively  with  the  CBPM  are  excluded,  the  EQ  has  a 
multiple  correlation  of  only .  1 0  (corrected  for  shrinkage) 
with  the  CBPM. 

Incremental  Validities 

At  this  point,  all  the  scores — except  for  the  Informa¬ 
tion  Processing  score  from  the  Analogies  test  and  7  of  the 
14  scores  from  the  Experiences  Questionnaire — have 
demonstrated  that  they  are  related  to  the  criteria.  The 
next  step  was  to  determine  which  scales  have  a  unique 
contribution  in  predicting  the  criteria.  That  is,  some 
scales  might  not  add  anything  to  the  prediction  because 
they  are  predicting  the  same  aspects  of  the  criteria  as 
some  other  scales. 

If  two  tests  predict  the  same  aspects  of  the  criteria  then 
they  are  redundant.  Only  one  of  the  tests  is  needed.  The 
amount  of  the  unique  contribution  that  a  test  makes 
toward  predicting  a  criterion  is  called  incremental  valid¬ 
ity.  More  precisely,  the  incremental  validity  of  a  test  is  the 
increase  in  the  validity  of  the  test  battery  (i.e.,  multiple 
correlation  of  the  criterion  with  the  predictors)  when 
that  test  is  added  to  a  battery. 

Table  5.5.2  shows  the  incremental  validities  for  each 
test  and  scale.  There  are  two  values  for  most  tests.  The 
first  value  shows  the  incremental  validity  when  the  test 
is  added  to  a  battery  that  contains  all  the  other  tests;  the 
other  value  shows  the  incremental  validity  when  the  test 
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is  added  to  only  the  tests  in  the  final  AT-SAT  battery.  In 
addition,  incremental  validities  for  the  final  version  of 
the  EQ  test  (in  which  three  of  the  original  EQ  scales  were 
dropped)  are  shown. 

Three  tests  have  a  substantial  unique  contribution 
to  the  prediction  of  the  criteria.  Each  has  an  incre¬ 
mental  validity  greater  that  .10  (corrected  for  shrink¬ 
age  but  not  for  range  restriction).  They  are,  in  order  of 
decreasing  incremental  validity,  Applied  Math,  EQ, 
and  Air  Traffic  Scenarios. 

Determination  of  Scale  Weights  for  the  Test  Battery 

The  full  AT-SAT  battery  would  require  more  than  a 
day  of  testing  time.  Thus,  it  was  desired  to  drop  some  of 
the  tests  for  this  reason  alone.  Therefore,  several  tests 
were  excluded  from  the  final  test  battery  taking  into 
consideration  the  following  goals: 

1.  Maintain  high  concurrent  validity. 

2.  Limit  the  test  administration  time  to  a  reasonable 
amount. 

3.  Reduce  differences  between  gender/racial  group 
means. 

4.  No  significant  differences  in  prediction  equations 
(i.e.,  regression  slopes  or  intercepts)  favoring  males  or 
whites  (i.e.,  no  unfairness). 

5.  Retain  enough  tests  to  allow  the  possibility  of  in¬ 
creasing  the  predictive  validity  as  data  becomes  available 
in  the  future. 

There  are  typically  three  main  types  of  weighting 
schemes:  regression  weighting,  unit  weighting,  and  va¬ 
lidity  weighting.  In  regression  weighting,  the  scales  are 
weighted  to  maximize  the  validity  of  the  predictor  com¬ 
posite  in  the  sample  of  examinees.  The  main  problem 
with  this  scheme  is  that  the  validity  drops  when  the 
predictor  weights  are  used  in  the  population.  Unit  weight¬ 
ing  gives  equal  weight  to  each  scale  or  test.  It  tends  to 
sacrifice  some  sample  validity,  but  its  validity  does  not 
typically  drop  in  the  population  because  the  weights  are 
chosen  independent  of  the  sample.  Validity  weighting 
assigns  each  scale’s  simple  validity  as  its  weight.  This 
scheme  is  a  compromise  between  the  two  methods. 
Validity  weights  do  almost  as  well  as  regression  weights 
in  the  sample.  More  importantly,  validity  weights  are  less 
sensitive  to  differences  in  samples  than  regression  weights. 

The  large  numbers  of  scales  and  parameters  to  con¬ 
sider  for  each  scale  made  it  difficult  to  subjectively  decide 
which  tests  to  drop.  For  each  scale,  ten  parameters  were 
relevant  to  this  decision.  To  aid  in  this  decision,  a 


computer  program  was  written  (using  Visual  Basic) 
which  essentially  considered  all  these  parameters  simul¬ 
taneously.  In  choosing  the  set  of  optimal  scale  weights, 
the  program  considered  the  following  sets  of  parameters 
of  the  resulting  predictor  composite:  overall  validity, 
differences  in  group  means,  differences  in  the  groups’ 
regression  slopes,  and  differences  in  the  groups’  inter¬ 
cepts.  There  were  three  parameters  for  each  type  of  group 
difference:  females  vs.  males,  blacks  vs.  whites,  Hispanics 
vs.  whites.  One  final  feature  of  the  program  is  that  it 
would  not  allow  negative  weights.  That  is,  if  a  scale’s 
computed  weight  was  such  that  a  high  score  on  the  scale 
would  lower  the  score  on  the  overall  score  then  the  scale’s 
weight  was  set  to  zero. 

Several  computer  runs  were  made.  For  each  run,  the 
relative  importance  of  the  parameters  were  varied.  The 
goal  was  to  maximize  the  overall  validity  while  minimiz¬ 
ing  group  differences.  In  the  end,  the  group  difference 
with  the  greatest  effect  on  the  overall  validity  was  the 
black  vs.  white  group  mean  on  the  composite  predictor. 
Thus,  the  ultimate  goal  became  to  reduce  the  differences 
between  the  black  and  white  means  without  reducing  the 
maximum  overall  validity  by  a  statistically  significant  amount. 

There  were  only  nine  scales  remaining  with  non-zero 
weights  after  this  process.  This  low  number  of  scales  was 
undesirable.  It  is  possible  that  some  of  the  excluded  tests 
might  perform  better  in  a  future  predictive  validity  study 
than  in  the  concurrent  study.  If  these  tests  are  excluded 
from  the  battery,  then  there  will  be  no  data  on  them  for 
the  predictive  validity  study.  Another  limitation  of  this 
technique  is  that  the  weights  will  change,  possibly  sub¬ 
stantially,  if  applied  to  another  sample. 

Therefore,  a  combination  of  the  validity  weighting 
and  optimal  weighting  schemes  was  used.  For  each  scale, 
the  weight  used  was  the  mean  of  the  optimal  and  validity 
weights.  A  description  of  the  computation  of  the  validity 
and  optimal  weights  follows. 

The  computation  of  the  validity  weights  for  a  single¬ 
scale  test  was  straightforward.  It  was  merely  the  correla¬ 
tion,  corrected  for  range  restriction,  of  the  scale  with  the 
composite  criterion.  The  computation  for  the  multi¬ 
scale  tests  was  somewhat  more  complex.  First,  the  mul¬ 
tiple  correlation,  corrected  for  range  restriction,  of  the 
test  with  the  composite  criterion  was  computed.  This 
represents  contribution  of  the  test  to  the  composite 
predictor.  Then,  the  correlations  of  each  of  the  test’s 
scales  with  the  composite  criterion,  corrected  for  range 
restriction,  were  computed.  The  validity  weights  of  the 
scales  were  computed  according  to  the  following  formula: 
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U’  —R  -  [Equation  5.5.1] 

b, 

7=1 

where  w.  =  validity  weight  of  scale  /,  r  =  correlation  of  the 
predictor  scale  with  the  criterion,  R  =  multiple  correlation 
of  the  test  with  the  criterion,  r.  =  the  correlation  with  the 
criterion  of  the  scaley  of  the  k  scales  within  the  test.  All 
correlations  were  corrected  for  range  restriction. 

The  validity  weights  and  optimal  weights  had  to  be 
put  on  a  common  metric  before  they  could  be  combined. 
Each  validity  weight  was  multiplied  by  a  constant  such 
that  all  the  weights  summed  to  1.00.  Similarly,  each 
optimal  weight  was  multiplied  by  a  constant  such  that  all 
the  weights  summed  to  1 .00.  Each  predictor’s  combined 
weight  was  then  computed  as  the  mean  of  its  rescaled 
optimal  and  validity  weights.  Finally,  the  combined 
weight  was  rescaled  in  the  same  manner  as  the  validity 
and  optimal  weights.  That  is,  each  combined  weight  was 
multiplied  by  a  constant  such  that  all  the  weights  summed 
to  1.00.  This  rescaling  was  done  to  aid  interpretation  of 
the  weights.  Each  weight  represents  a  predictor’s  relative 
contribution,  expressed  as  a  proportion,  to  the  predictor 
composite. 

Predictor  Composite 

The  predictor  composite  was  computed  using  the 
combined  predictor  weights  described  above.  Before 
applying  the  weights,  the  predictor  scores  had  to  be 
transformed  to  a  common  metric.  Thus,  each  predictor 
was  standardized  according  to  the  pseudo-applicant 
sample.  That  is,  a  predictor’s  transformed  score  was  com¬ 
puted  as  a  £-score  according  to  the  following  formula: 

7  =  _  ftp  [Equation  5.5.2] 

&p 

where  z  =  the  predictor’s  £-score,  x  -  the  raw  predictor 
score,  ft  =  the  predictor’s  mean  score  in  the  pseudo- 
applicant  sample,  and  ^  =  the  predictor’s  standard 

deviation  in  the  pseudo-applicant  sample  (i.e.,  the  estimate 
of  the  predictor’s  standard  deviation  in  the  population 
based  on  the  pseudo-applicant  sample  data). 

The  predictor  composite  was  then  computed  by  ap¬ 
plying  the  rescaled  combined  weights  to  the  predictor  z- 
scores.  That  is,  the  predictor  composite  was  computed 
according  to  the  following  formula: 


raw  composite  predictor  =  ^  w,  zf  t  ^  q  u  a  t  i  o  n 

where  k  -  the  number  of  predictors,  w.  =  the  rescaled 
combined  weight  of  the  /th  predictor,  and  z.=  the  z- 
score  of  the  /th  predictor.  In  other  words,  the  raw 
composite  predictor  score  is  the  weighted  sum  of  the  z - 
scores.  This  score  was  rescaled  such  that  a  score  of  70 
represented  the  cut  score  and  100  represented  the 
maximum  possible  score.  This  is  the  scaled  AT-SAT 
battery  score.  The  determination  of  the  cut  score  is 
described  later  in  this  chapter.  To  simplify  the 
programming  of  the  software  that  would  administer  and 
score  the  AT-SAT  battery,  a  set  of  weights  was  computed 
that  could  be  applied  to  the  raw  predictor  scores  to 
obtain  the  scaled  AT-SAT  battery  score.  Thus  the  scaled 
AT-SAT  battery  score  was  computed  according  to  the 
following  formula: 

A 

Scaled  AT-SAT  Battery  Score  =  y jwiJCi 

i=\ 

Equation  5.5.4] 

where  k  =  the  number  of  predictors,  tv.  =  the  raw-score 
weight  of  the  it h  predictor,  and  x  =  the  raw  score  of  the 
/th  predictor. 

The  effects  of  using  various  weighting  schemes  are 
shown  in  Table  5.5.3.  The  table  shows  the  validities  both 
before  and  after  correcting  for  shrinkage  and  range 
restriction.  Because  the  regression  procedure  fits  an 
equation  to  a  specific  sample  of  participants,  a  drop  in 
the  validity  is  likely  when  the  composite  predictor  is  used 
in  the  population.  The  amount  of  the  drop  increases  as 
sample  size  decreases  or  the  number  of  predictors  in¬ 
creases.  The  correction  for  shrinkage  attempts  to  esti¬ 
mate  the  amount  of  this  drop.  The  formula  used  to 
estimate  the  validity  corrected  for  shrinkage  is  referred  to 
by  Carter  (1979)  as  Wherry  (B)  (Wherry,  1940).  The 
formula  is  : 

ff=  jl-(l-/?2)  TH}  _  [Equation  5.5.5] 

V  n  -  k  - 1 

where  p  -  the  validity  corrected  for  shrinkage,  R  is  the 
uncorrected  validity,  n  -  the  sample  size,  and  k  =  the 
number  of  predictors.  Where  validities  were  corrected 
for  both  range  restriction  and  shrinkage,  the  shrinkage 
correction  was  performed  first. 
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As  noted  above,  the  final  AT-SAT  score  was  com¬ 
puted  using  the  Combined  method  of  weighting  the 
predictors.  Only  the  regression  method  had  a  higher 
validity.  In  fact,  the  Combined  method  probably  has  a 
higher  validity  if  we  consider  that  its  correction  for 
shrinkage  overcorrects  to  some  extent.  Finally,  the  re¬ 
gression-weighted  validity  is  based  on  all  35  scales 
whereas  the  Combined  validity  is  based  on  just  26 
tests.  Thus,  the  Combined  weighting  method  pro¬ 
duces  the  best  validity  results. 

The  Combined  method  produced  the  second-best 
results  in  terms  of  mean  group  differences  and  fairness. 
Only  the  Optimal  low  ^/-score  weighting  method  had 
better  results  in  these  areas,  and  its  validity  was  much 
lower  than  the  Combined  method’s  validity.  None  of  the 
weighting  methods  produced  a  statistically  significant 
difference  in  standardized  regression  slopes  among  the 
groups.  Thus,  the  Combined  weighting  method  was  the 
best  overall.  It  had  the  highest  validity  and  the  second- 
best  results  in  terms  of  group  differences  and  fairness. 
Therefore,  the  Combined  weighting  method  was  used  to 
compute  the  final  AT-SAT  battery  score. 

Final  AT-SAT  Battery  Validity 

The  best  estimate  of  the  validity  of  the  AT-SAT 
battery  is  .76.  This  value  is  extremely  high.  Table  5.5.4 
shows  the  validity  of  the  AT-SAT  battery  for  various 
criteria.  The  table  also  shows  how  various  statistical 
corrections  affect  the  validity  estimate.  The  most  rel¬ 
evant  validity  of  .76  is  the  correlation  with  the  composite 
criterion  which  is  corrected  for  range  restriction,  shrink¬ 
age,  and  criterion  unreliability. 

The  low  sample  size  for  the  high  fidelity  criteria 
precludes  accurate  estimates  of  validity.  The  purpose  of 
the  high-fidelity  criteria  was  to  obtain  independent 
evidence  that  the  CBPM  and  Ratings  were  related  to  job 
performance.  As  shown  in  a  previous  chapter,  the  high 
correlations  of  the  CBPM  and  Ratings  with  the  high 
fidelity  criteria  are  strong  evidence  that  the  CBPM  and 
Ratings  are  accurate  indicators  of  job  performance. 

Interrater  agreement  reliability  was  used  to  correct  the 
validities  for  the  Ratings  and  HiFi  criteria.  Reliability  for 
the  CBPM  was  estimated  by  computing  its  internal 
consistency  (coefficient  alpha  =  .59),  but  this  figure  is 
probably  an  underestimate  because  the  CBPM  appears 
to  be  multidimensional  (according  to  factor  analyses). 
Ideally,  the  reliability  for  the  CBPM  should  be  com¬ 
puted  as  a  test-retest  correlation.  This  could  not  be 
computed,  however,  because  each  examinee  took  the 


CBPM  only  once.  Previous  research  has  found  that 
similar  measures  (i.e.,  situational  judgement  tests)  have 
test-retest  reliabilities  of  about  .80,  with  most  in  the 
range  between  .7-. 9.  Thus,  three  different  reliabilities 
were  used  to  correct  the  CBPM’s  validity  for  unreliability: 
.8  (best  guess),  .9  (upper  bound  estimate),  and  .7  (lower 
bound  estimate),  respectively.  The  reliability  of  the 
composite  measure  could  not  be  directly  measured. 
Therefore,  an  approximation  of  the  composite  criterion 
reliability  was  computed  as  the  mean  of  the  ratings  and 
CBPM  reliabilities. 

Determining  the  Cut  Score 

One  of  the  specifications  for  the  AT-SAT  battery  was 
that  a  score  of  70  would  represent  the  cut  score  and  a 
score  of  100  would  represent  the  highest  possible  score. 
The  cut  score  and  maximum  score  were  first  determined 
on  the  AT-SAT  battery’s  original  scale.  Then  these  two 
scores  were  transformed  to  scores  of  70  and  100  on  the 
scaled  AT-SAT  battery  scale. 

The  determination  of  the  highest  possible  score  was 
relatively  straightforward.  There  was,  however,  one  com¬ 
plication.  The  maximum  possible  scores  for  the  simula- 
tionscales  (i.e.,  Letter  Factory  scales,  Air  Traffic  Scenarios 
scales)  and  some  of  the  other  scales  (e.g.,  Analogies 
information  processing  scores)  were  unknown.  Thus, 
the  determination  of  the  highest  possible  score  was  not 
simply  a  matter  of  adding  up  the  maximum  scores 
possible  for  each  scale.  For  the  scales  with  an  unknown 
maximum  possible  score,  the  maximum  scores  attained 
during  the  study  were  used  to  estimate  the  highest  scores 
likely  to  be  attained  on  these  scales  in  the  future. 

The  determination  of  the  cut  score  was  more  in¬ 
volved.  The  main  goal  in  setting  the  cut  score  was  to  at 
least  maintain  the  current  level  of  job  performance  in  the 
controller  workforce.  After  examining  the  effects  of 
various  possible  cut  scores  on  controller  performance,  a 
cut  score  was  selected  that  would  slightly  improve  the  job 
performance  of  the  overall  controller  workforce.  Specifi¬ 
cally,  the  cut  score  was  set  such  that  the  mean  predicted 
criterion  score,  among  pseudo-applicants  passing  the 
battery,  was  at  the  56rh  percentile  of  the  current  control¬ 
ler  distribution  of  criterion  scores. 

Table  5.5.5  shows  the  effects  of  this  cut  score  on 
selection  rates  and  predicted  job  performance.  If  all  the 
pseudo-applicants  were  hired,  their  mean  job  perfor¬ 
mance  would  be  at  only  the  33rd  percentile  of  the  current 
controller  distribution.  Thus,  using  the  AT-SAT  Bat¬ 
tery,  with  the  chosen  cut  score,  is  considerably  better 
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than  using  no  screening.  That  is,  if  all  of  the  pseudo¬ 
applicants  were  hired  (or  some  were  randomly  selected  to 
be  hired),  their  performance  level  would  be  much  lower 
than  the  current  Controllers. 

Impact  of  AT-SAT  on  Workforce  Capabilities 

Figure  5.5.1  shows  the  relationship  between  scores  on 
the  AT-SAT  battery  and  the  expected  or  average  perfor¬ 
mance  of  examinees  at  each  score  level.  For  comparison 
purposes,  the  previous  OPM  battery,  which  had  a  (gen¬ 
erously  corrected)  validity  of  about  .30  has  been  placed 
on  the  same  scale  as  the  AT-SAT  composite.  The  pri¬ 
mary  point  is  that  applicants  who  score  very  high  (at  90) 
on  the  AT-SAT  are  expected  to  perform  near  the  top  of 
the  distribution  of  current  controllers  (at  the  86th  percen¬ 
tile).  Applicants  who  score  very  high  (at  90)  on  the  OPM 
test,  however,  are  expected  to  perform  only  at  the  middle 
of  the  distribution  of  current  controllers  (at  the  50th 
percentile).  Only  1  out  of  1 47  applicants  would  be 
expected  to  get  an  OPM  score  this  high  (90  or  above). 
Someone  with  an  OPM  score  of  100  would  be  expected 
to  perform  at  the  58th  percentile.  Consequently,  there  is 
no  way  that  the  OPM  test,  by  itself,  could  be  used  to 


select  applicants  much  above  the  mean  of  current  con¬ 
trollers.  In  the  past,  of  course,  the  OPM  test  was  com¬ 
bined  with  a  nine-week  screening  program  resulting  in 
current  controller  performance  levels.  The  AT-SAT  is 
expected  to  achieve  about  this  same  level  of  selectivity 
through  the  pre-hire  screening  alone. 

Table  5.5.6  shows  the  percent  of  high  performers 
expected  for  different  cutpoints  on  the  AT-SAT  and 
OPM  batteries.  This  same  information  is  shown  graphi¬ 
cally  in  Figure  5.5.2.  Here,  high  performance  is  defined 
as  the  upper  third  of  the  distribution  of  performance  in 
the  current  workforce  as  measured  by  our  composite 
criterion  measure.  If  all  applicants  scoring  70  or  above  on 
the  AT-SAT  are  selected,  slightly  over  one-third  would 
be  expected  to  be  high  performers.  With  slightly  greater 
selectivity,  taking  only  applicants  scoring  75. 1  or  above, 
the  proportion  of  high  performers  could  be  increased  to 
nearly  half.  With  a  cutscore  of  70,  it  should  be  necessary 
to  test  about  5  applicants  to  find  each  hire.  At  a  cutscore 
of  75*  1 ,  the  number  of  applicants  tested  per  hire  goes  up 
to  about  10.  By  comparison,  1,376  applicants  would 
have  to  be  tested  for  each  hire  to  obtain  exactly  one-third 
high  performers  using  the  OPM  screen. 
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CHAPTER  5.6 


Analyses  Oof  Group  Differences  and  Fairness 
Gordon  Waugh,  HumRRO 


SUMMARY 

The  group  means  on  the  composite  predictor  for 
females,  blacks,  and  Hispanics  were  significantly  lower 
than  the  means  for  the  relevant  reference  groups  (males, 
whites).  The  difference  was  greatest  for  blacks.  The 
cognitive  tests  displayed  much  greater  differences  than 
did  the  EQ  scales.  However,  the  EQ  scales  had  much 
lower  validity  as  well.  Although  the  predictor  composite 
exhibited  lower  group  means  for  minorities,  no  evidence 
of  unfairness  was  found.  In  fact,  the  composite  predictor 
over-predicted  the  performance  of  all  three  minority 
groups  (females,  blacks,  and  Hispanics)  at  the  cut  score. 
The  validity  coefficients  and  regression  slopes  were  re¬ 
markably  similar  among  the  groups.  Among  the  indi¬ 
vidual  test  scales,  there  were  no  cases  (out  of  a  possible 
111)  in  which  the  slopes  of  the  regression  lines  differed 
significantly  between  a  minority  and  reference  group.  These 
results  show  that  the  test  battery  is  fair  for  all  groups. 

INTRODUCTION 

A  personnel  selection  test  may  result  in  differences 
between  white  and  minority  groups.  In  order  to  continue 
to  use  a  test  that  has  this  result,  it  is  required  to  demon¬ 
strate  that  the  test  is  job-  related  or  valid.  Two  types  of 
statistical  analyses  are  commonly  used  to  assess  this  issue. 
The  analysis  of  mean  group  differences  determines  the 
degree  to  which  test  scores  differ  for  a  minority  group  as 
a  whole  (e.g.,  females,  blacks,  Hispanics)  when  com¬ 
pared  with  its  reference  group  (i.e.,  usually  whites  or 
males).  Fairness  analysis  determines  the  extent  to  which 
the  relationship  between  test  scores  and  job  perfor¬ 
mance  differs  for  a  minority  group  compared  to  its 
reference  group. 

Our  sample  contained  enough  blacks  and  Hispan¬ 
ics  to  analyze  these  groups  separately  but  too  few 
members  of  other  minority  groups  to  include  in  the 
analyses.  It  was  decided  not  to  run  additional  analyses 
with  either  all  minorities  combined  or  with  blacks  and 
Hispanics  combined  because  the  results  differed  con¬ 
siderably  for  blacks  vs.  Hispanics.  Thus,  the  following 
pairs  of  comparison  groups  were  used  in  the  fairness 


analyses:  male  vs.  female,  white  vs.  black,  and  white 
vs.  Hispanic.  The  descriptive  statistics  for  the  predic¬ 
tors  and  criteria  are  shown  in  Tables  5.6. 1-5. 6. 3. 

Cut  Scores 

Both  the  analyses  of  sub-group  differences  and  fair¬ 
ness  required  a  cut  score  (i.e.,  a  specified  passing  score) 
for  each  test  and  for  the  predictor  composite  score. 
Therefore,  hypothetical  cut  scores  had  to  be  determined. 
The  cut  score  on  the  predictor  composite  was  set  at  the 
32nd  percentile  on  the  controller  distribution.  (This  score 
was  at  the  78th  percentile  on  the  pseudo-applicant  distri¬ 
bution.)  Thus,  the  hypothetical  cut  score  for  each  test 
was  also  set  at  the  32nd  percentile  on  the  controller 
distribution  for  the  purposes  of  the  fairness  and  group 
mean  difference  analyses.  The  determination  of  the  cut 
score  is  discussed  elsewhere  in  this  report.  Regression 
analyses  predicted  that  the  mean  level  of  job  performance 
for  applicants  passing  the  AT-SAT  battery  would  be  at 
the  56th  percentile  of  the  job  performance  of  current 
controllers.  That  is,  it  is  predicted  that  applicants  passing 
the  battery  will  perform  slightly  better  than  current 
controllers. 

Estimation  of  Missing  Values 

There  were  few  blacks  in  the  controller  (n  -  98)  and 
pseudo-applicant  samples  (n  =  62).  In  addition,  there 
were  even  fewer  in  the  analyses  because  of  missing  values 
on  some  tests.  When  the  composite  predictor  was  com¬ 
puted,  missing  values  on  the  individual  scales  were 
estimated.  Otherwise,  a  participant  would  have  received 
a  missing  value  on  the  composite  if  any  of  his/her  test 
scores  were  missing.  Each  missing  score  was  estimated 
using  a  regression  equation.  The  regression  used  the 
variable  with  the  missing  score  as  the  dependent  variable 
and  the  scale  that  best  predicted  the  missing  score  as  the 
independent  variable.  The  predictor  scale  had  to  be  from 
a  different  test  than  the  missing  score.  For  example,  if  an 
examinee’s  Applied  Math  score  was  missing  then  his/her 
Angles  score  was  used  to  estimate  it.  If  both  the  Applied 
Math  and  Angles  scores  were  missing,  then  the  estimated 
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composite  predictor  score  would  also  be  missing.  Each 
missing  EQ  score,  however,  was  predicted  using  another 
EQ  scale.  Missing  scores  were  estimated  only  when 
building  the  composite  predictor.  That  is,  missing  values 
were  not  estimated  for  analyses  that  used  the  individual 
test  scores.  This  was  judged  to  be  a  conservative  estima¬ 
tion  procedure  because  (a)  only  one  independent  vari¬ 
able  was  used  in  each  estimation  regression  (b)  none  of 
the  blacks  and  few  of  the  other  examinees  were  missing 
more  than  one  test  score,  and  (c)  each  test  score  contrib¬ 
uted  only  a  small  amount  to  the  final  composite  predic¬ 
tor  score.  The  amount  of  error  caused  by  the  estimation 
of  missing  values  is  very  likely  to  be  trivial.  To  ensure  that 
the  covariances  were  not  artificially  increased  by  the 
estimation  of  missing  values,  random  error  was  added  to 
each  estimated  value. 

GROUP  DIFFERENCES 

Analyses 

Only  the  pseudo-applicant  sample  was  used  for  the 
group  difference  analyses.  This  sample  best  represented 
the  population  of  applicants.  Therefore,  air  traffic  con¬ 
trollers  were  excluded  from  these  analyses. 

The  Uniform  Guidelines  on  Employee  Selection  Proce¬ 
dures  (Federal  Register,  1978,  Section  4.D.)  state  that 
evidence  of  adverse  impact  exists  when  the  passing  rate 
for  any  group  is  less  than  four-fifths  of  the  passing  rate  for 
the  highest  group: 

A  selection  rate  for  any  race,  sex,  or  ethnic  group  which 
is  less  than  four-fifths  (V5)  (or  eighty  percent)  of  the  rate  for 
the  group  with  the  highest  rate  will  generally  be  regarded 
by  the  Federal  enforcement  agencies  as  evidence  of  adverse 
impact,  while  a  greater  than  four-fifths  rate  will  generally 
not  be  regarded  by  Federal  enforcement  agencies  as  evi¬ 
dence  of  adverse  impact. 

Therefore,  the  passing  rates  for  each  test  were  com¬ 
puted  for  all  five  groups  (males,  females,  whites,  blacks, 
Hispanics).  Then  the  passing  rates  among  the  groups 
were  compared  to  see  if  the  ratio  of  the  passing  rates  fell 
below  four-fifths.  Separate  comparisons  were  done  within 
the  gender  groups  and  within  the  racial  groups.  That  is, 
males  and  females  were  compared;  and  blacks  and  His¬ 
panics  were  compared  to  whites. 

The  Uniform  Guidelines  (Section  D.4.)  state  that 
adverse  impact  might  exist  even  if  the  passing  rate  for  the 
minority  group  is  greater  than  four-fifths  the  reference 
group’s  passing  rate: 


Smaller  differences  in  selection  rate  may  nevertheless 
constitute  adverse  impact,  where  they  are  significant  in 
both  statistical  and  practical  terms  .  .  . 

Therefore,  the  differences  in  the  passing  rates  were 
tested  for  statistical  significance  using  2 ' 2  chi-square 
tests  of  association.  For  each  predictor  score,  one  chi- 
square  analysis  was  done  for  each  of  the  following  pairs 
of  groups:  male-female,  white-black,  and  white-His- 
panic.  An  example  is  shown  in  Table  5.6.4  below.  This 
shows  how  the  chi-square  test  was  computed  which 
compared  male  and  female  passing  rates. 

The  groups  were  also  compared  by  computing  the 
mean  test  score  for  each  group.  The  differences  in  the 
means  between  the  mi  nority  groups  and  reference  groups 
(i.e.,  males  or  whites)  were  then  tested  for  statistical 
significance  using  independent-groups  /-tests.  The  dif¬ 
ferences  between  the  means  were  then  converted  to  d- 
scores  which  express  these  differences  in  terms  of  standard 
deviation  units  based  on  the  reference  group’s  standard 
deviation.  For  example,  a  d- score  of  -.48  for  females 
indicates  that  the  mean  female  score  is  -.48  standard 
deviations  below  the  mean  of  the  male  distribution  of 
scores  (i.e.,  at  the  32nd  percentile  of  the  male  distribu¬ 
tion  according  to  a  table  of  the  normal  distribution). 

Results  and  Conclusions 

Table  5.6.5  shows  the  results  for  the  passing  rate 
analyses.  Several  tests — including  the  predictor  compos¬ 
ite — exhibited  evidence  of  group  differences  for  females, 
blacks,  and  Hispanics  according  to  the  four-fifths  rule. 
In  most  of  these  cases,  the  difference  in  passing  rates  was 
statistically  significant.  Females  and  Hispanics  had  simi¬ 
lar  passing  rates;  blacks  had  by  far  the  lowest  passing 
rates. 

Table  5.6.5  also  shows  the  differences  between  the 
group  means  expressed  as  ^/-scores.  The  significant  d- 
scores  are  asterisked  in  the  table.  These  results  were  very 
similar  to  those  for  the  passing  rates.  The  group 
predictor  combinations  that  had  significantly  lower  pass¬ 
ing  scores  (compared  to  the  reference  group)  also  tended 
to  have  significantly  lower  ^/-scores.  All  three  minority 
groups  tended  to  score  below  their  reference  groups,  but 
the  differences  were  often  not  statistically  significant. 
Blacks  scored  lowest  on  most  tests.  On  the  composite 
predictor,  Hispanics  had  the  highest  d-sc ore,  followed 
by  females  and  blacks,  respectively.  The  Hispanic  d - 
score  was  not  statistically  significant. 

The  group  differences  for  the  EQ  scales  were  much 
lower  than  for  the  cognitive  tests.  (The  Memory  Test  and 
the  Memory  Retest,  however,  had  very  small  group 
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differences.  In  fact,  females  did  better  than  males  on 
these  two  tests.)  For  example,  for  blacks,  the  median  d - 
score  was  -.48  among  the  23  cognitive  scores  but  only- 
.20  among  the  1 4  EQscales.  However,  the  EQscales  also 
had  much  lower  validity  than  did  the  other  tests.  This  is 
probably  why  the  passing  rates  are  much  higher  for  the 
EQ.  In  fact,  the  passing  rates  on  half  of  the  EQ  scales 
were  higher  for  the  pseudo-applicants  than  for  the  control¬ 
lers  (i.e.,  half  of  the  passing  rates  were  higher  than  68%, 
which  is  the  passing  rate  for  each  test  in  the  controller 
sample).  In  all  the  other  tests,  the  passing  rate  was  much 
lower  for  the  pseudo-applicants  than  for  the  controllers. 

There  are  two  possible  reasons  for  the  high  passing 
rates  for  the  EQ  scales:  (a)  the  pseudo-applicants  and 
current  controllers  possess  nearly  the  same  levels  of  the 
personality  traits  supposedly  measured  by  the  EQor  (b) 
the  EQ  scales  are  measuring  some  unwanted  constructs 
(probably  in  addition  to  the  traits  that  the  scales  were 
designed  to  measure).  If  the  first  possibility  is  true,  then 
one  must  conclude  that  either  these  traits  are  not  really 
needed  on  the  job  or  that  the  current  controllers  would 
perform  even  better  on  the  job  if  they  improved  in  these 
traits.  If  the  second  possibility  is  true,  then  some  un¬ 
wanted  constructs,  such  as  social  desirability,  are  being 
measured  to  some  degree  by  the  EQ  scales. 

In  conclusion,  the  predictor  composite  for  the  final 
AT- SAT  battery  exhibited  lower  scores  for  all  three 
minority  groups  (i.e.,  females,  blacks,  and  Hispanics) 
compared  to  their  reference  groups  (i.e.,  males  and 
whites)  in  terms  of  both  passing  rates  and  ^-scores.  All  of 
these  differences,  except  for  the  Hispanic  d- score,  were 
statistically  significant.  The  relative  passing  rates  on  the 
predictor  composite  for  females,  blacks,  and  Hispanics 
(compared  to  the  passing  rates  for  the  reference  groups: 
males  and  whites)  were  .54,  .11,  and  .46,  respectively. 
Thus,  there  was  evidence  of  sub-group  differences  in  test 
performance  for  the  three  minority  groups. 

It  should  be  noted  that  subgroup  differences  in  pre¬ 
dictor  scores  do  not  necessarily  imply  bias  or  unfairness. 
If  low  test  scores  are  associated  with  low  criterion  perfor¬ 
mance  and  high  test  scores  are  related  to  high  criterion 
performance,  the  test  is  valid  and  fair.  The  fairness  issue 
is  discussed  below. 

FAIRNESS 

Analyses 

The  fairness  analyses  requires  analyses  of  job  perfor¬ 
mance  as  well  as  test  scores.  As  a  consequence,  all  fairness 
analyses  were  performed  on  the  concurrent  validation 


controller  sample.  A  test  is  considered  fair  when  the 
relationship  between  the  predictor  test  and  job  perfor¬ 
mance  is  the  same  for  all  groups.  In  our  analyses,  only 
differences  that  aid  whites  or  males  were  considered  to  be 
unfair.  Fairness  is  assessed  by  performing  regression 
analyses  using  the  test  score  as  the  independent  variable 
and  the  criterion  measure  as  the  dependent  variable.  To 
assess  the  fairness  of  a  predictor  for  females,  for  example, 
two  regressions  are  performed:  one  for  males  and  one  for 
females.  In  theory,  the  predictor  is  considered  to  be  fair 
if  the  male  and  female  regression  lines  are  identical.  In 
practice,  the  test  is  considered  to  be  fair  if  the  difference 
between  the  equations  of  the  two  regression  lines  is  not 
statisticallysignificant  (given  a  reasonable  amount  of  power). 

The  equations  of  the  two  regression  lines  (e.g.,  male 
vs.  female  regression  lines)  can  differ  in  their  slopes  or 
their  intercepts.  If  the  slopes  differ  significantly  then  the 
predictor  is  not  fair.  If  the  slopes  do  not  differ  signifi¬ 
cantly,  then  the  intercepts  are  examined.  In  this  study,  to 
maximize  interpretability,  the  predictor  scores  were  scaled 
such  that  all  the  intercepts  occurred  at  the  cut  point  (i.e., 
passing  score).  Specifically,  the  cut  score  was  subtracted 
from  the  predictor  score. 

Although  fairness  analysis  is  based  on  a  separate 
regression  line  for  each  of  the  two  groups  being  com¬ 
pared,  a  quicker  method  uses  a  single  regression  analysis. 
The  significance  tests  in  this  analysis  are  equivalent  to  the 
tests  that  would  be  done  using  two  lines.  In  this  analysis, 
there  is  one  dependent  variable  and  three  independent 
variables.  The  dependent  variable  is  the  criterion.  The 
independent  variables  are  shown  below: 

•  The  predictor. 

•  The  group  (a  nominal  dichotomous  variable  which 
indicates  whether  the  person  is  in  the  focal  or  reference 
group).  If  this  independent  variable  is  significant,  it 
indicates  that,  if  a  separate  regression  were  done  for  each 
of  the  two  groups,  the  intercepts  of  the  regression  lines 
would  be  significantly  different.  Because  the  predictors 
in  this  study  were  rescaled  for  these  analyses  such  that  the 
intercepts  occurred  at  the  cut  scores,  a  difference  in 
intercepts  means  that  the  two  regression  lines  are  at 
different  elevations  at  the  cut  score.  That  is,  they  have 
different  criterion  scores  at  the  predictor’s  cut  score. 

•  The  predictor  by  group  interaction  term.  This  is  the 
product  of  group  (i.e.,  0  or  1 )  and  the  predictor  score.  If  this 
independent  variable  is  significant,  it  indicates  that,  if  a 
separate  regression  were  done  for  each  of  the  two  groups, 
the  slopes  of  the  regression  lines  would  be  significantly 
different.  The  standardized  slopes  equal  the  validities. 
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The  regression  equation  is  shown  below: 

criterion  -  A  +  b  ,.  predictor  +  b  group  +  b  interaction  +  error 

0  predictor  x  group  o  X  interaction 


The  composite  criterion  and  the  composite  predictor 
were  used  for  the  fairness  analyses.  The  composite  crite¬ 
rion  was  the  weighted  sum  of  the  composite  rating  and 
the  CBPM.  Based  on  their  relationships  with  the  high 
fidelity  criterion  measures,  the  ratings  and  CBPM  were 
assigned  weights  of  .4  and  .6  respectively.  The  ratings  and 
CBPM  scores  were  standardized  before  they  were  added. 

RESULTS  AND  CONCLUSIONS 

Examples  of  the  fairness  regression  scatterplots  are 
shown  in  Figures  5.6.1,  5.6.2,  and  5.6.3  below.  The 
regression  lines  for  both  groups  (i.e.,  reference  and 
minority)  are  shown  in  each  plot.  The  slopes  of  the  two 
regression  lines  are  very  similar  in  each  of  the  three 
graphs.  Thus,  the  validities  differ  little  between  the 
groups  in  each  graph.  The  near-parallelism  of  the  regres¬ 
sion  lines  is  reflected  in  the  similar  values  of  the  two 
groups'  standardized  slopes  listed  in  the  graphs  and  in 
Table  5.6.6.  In  terms  of  the  intercepts,  however,  the 
white  and  male  regression  lines  are  above  the  female, 
Hispanic,  and  especially  the  black  regression  lines  at  the 
cut  score.  Thus,  the  predictor  composite  over- predicts 
performance  for  the  three  minority  groups  compared 
with  the  reference  groups,  which  means  that  the  test 
actually  favors  the  minority  groups.  Under  these  circum¬ 
stances,  a  regression  equation  based  on  the  total  sample 
produces  predicted  job  performance  levels  that  are  higher 
than  the  actual  performance  levels  observed  for  minori¬ 
ties.  In  a  selection  situation,  minorities  would  be  favored 
in  that  they  would  achieve  a  higher  ranking  on  a  selection 
list  than  would  be  indicated  by  actual  performance. 

Table  5.6.6  shows  the  results  of  the  fairness  regres¬ 
sions  for  all  of  the  predictor  scores.  It  displays  the 
standardized  slopes  for  each  regression  line.  These  are 
equivalent  to  validity  coefficients.  The  table  also  shows 
the  Regression  Lines’  Difference  at  Cut  Score  (in  Std.  Dev . 
Units).  This  is  the  difference  between  the  intercepts 
divided  by  the  reference  group’s  standard  error  of  esti¬ 
mate.  Thus  it  can  be  considered  to  be  the  difference 
between  minority  vs.  reference  groups’  predicted  crite- 


[Equation  5.6.1] 

rion  scores  at  the  cut  score  scaled  in  standard  deviation 
units  about  the  regression  line17,  A  negative  value  indi¬ 
cates  that  the  minority’s  regression  line  was  below  the 
reference  group’s  line. 

The  table  shows  that  the  slopes  of  the  regression  lines 
are  very  similar  for  almost  all  of  the  predictors.  There  are 
no  significant  differences  in  either  the  slopes  or  inter¬ 
cepts  that  favor  the  whites  or  males,  except  for  the  EQ 
Self-Awareness  scale  whose  slope  favors  males.  There¬ 
fore,  the  test  battery  is  equally  valid  for  all  groups.  In 
addition,  the  intercepts  for  males  and  whites  are  above 
the  intercepts  for  females,  blacks  and  Hispanics  for 
every  predictor.  Thus,  there  is  no  evidence  of  unfair¬ 
ness  whatsoever. 

The  absence  of  significant  differences  between  inter¬ 
cepts  (at  the  cut  score)  in  Table  5.6.6  shows  that  the 
minority  group’s  intercept  (at  the  cut  score)  was  never 
significantly  above  the  reference  group’s  intercept.  In 
fact,  the  reverse  was  often  true.  That  is,  for  many 
predictors,  the  performance  of  the  minority  group  was 
over-  predicted  by  the  predictor  score.  The  degree  of  over¬ 
prediction  was  greatest  for  blacks  and  least  for  females. 

Another  way  to  examine  fairness  is  to  see  if  the  group 
differences  are  similar  in  the  composite  predictor  and 
composite  criterion.  Table  5.6.7  shows  this  analysis. 
Although  females,  blacks,  and  Hispanics  had  lower 
scores  and  passing  rates  on  the  composite  predictor  than 
males  and  whites,  these  differences  were  virtually  identi¬ 
cal  using  the  criterion  scores.  None  of  the  discrepancies 
were  statistically  significant. 

Both  the  fairness  analyses  and  the  comparison  of  the 
group  differences  on  the  predictor  and  criterion  strongly 
support  the  fairness  of  the  final  predictor  battery  score. 
The  slopes  among  the  groups  are  very  similar  and  the 
differences  in  intercepts  always  favor  the  minority  group. 
The  group  differences  in  terms  of  passing  rates  and 
differences  in  means  are  remarkably  similar  in  the  predic¬ 
tor  compared  to  the  criterion.  The  fairness  analyses 
provide  strong  evidence  of  fairness  for  the  individual 
tests  as  well. 


17  Linear  regression  assumes  that  the  standard  deviation  of  the  criterion  scores  is  the  same  at  every  predictor  score.  This  is 
called  homoscedasdicity.  In  practice,  this  assumption  is  violated  to  varying  degrees.  Thus,  in  theory,  the  standard  error  of 
estimate  should  equal  the  standard  deviation  of  the  criterion  scores  at  the  predictor’s  cut  score — and  at  every  other  predictor 
score  as  well.  In  practice,  this  is  only  an  approximation. 
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TARGETED  RECRUITMENT 


The  sample  size  of  each  of  the  groups  is  an  important 
issue  in  fairness  regressions.  If  the  samples  are  too  small, 
the  analyses  will  be  unable  to  detect  statistically  signifi¬ 
cant  evidence  of  unfairness.  Figure  5.6.4  below  shows 
the  95%  confidence  intervals  for  the  slope.  The  graph 
clearly  shows  the  wide  confidence  band  for  Hispanics; 
the  moderate  bands  for  females  and  blacks;  and  the 
narrow  bands  for  males,  whites,  and  the  entire  sample. 
The  slopes  at  the  bottom  of  all  confidence  bands  are  well 
above  zero  which  shows  that  the  validity  is  statistically 
significant  for  each  group. 

The  power  analyses  were  done  to  consider  the  possi¬ 
bility  that  the  analyses  were  not  sensitive  enough  (i.e.,  the 
sample  size  was  too  small)  to  have  discovered  evidence  of 
unfairness  (see  Table  5.6.8).  From  the  fairness  regres¬ 
sions,  the  reference  groups  were  compared  with  the 
minority  groups  in  terms  of  their  slopes  and  intercepts. 
For  each  pair  of  slopes  and  intercepts,  the  analyses 
determined  how  small  the  difference  (i.e.,  a  difference 
favoring  the  reference  groups)  between  the  groups  would 
have  to  be  in  the  population  to  achieve  a  power  level  of 
80%.  A  power  level  of  80%  means  that,  if  we  ran  the 
analysis  for  100  different  samples,  we  would  find  a 
statistically  significant  difference  between  the  two  groups 
(i.e. ,  minority  vs.  reference  group)  in  80  of  those  samples. 

The  power  analyses  showed  that  even  relatively  small 
differences  between  groups  would  have  been  detected  in 
our  fairness  analyses.  Due  to  its  smaller  sample  size,  the 
Hispanic  group  has  the  largest  detectable  differences. 
Table  5.6.8  shows  the  sizes  of  the  smallest  detectable 
differences  at  80%  power  andy?  <  .05. 

DISCUSSION 

Although  many  of  the  tests,  including  the  final  AT- 
SAT  battery  score,  exhibited  differences  between 
groups,  there  is  no  reliable  evidence  that  the  battery  is 
unfair.  The  fairness  analyses  show  that  the  regression 
slopes  are  very  similar  among  the  groups  (white,  male, 
female,  black,  Hispanic).  There  are  differences  among 
the  intercepts  (at  the  cut  score),  but  these  differences 
favor  the  minority  groups.  Thus,  there  is  strong  evi¬ 
dence  that  the  battery  is  fair  for  females,  blacks,  and 
Hispanics.  These  results  show  that  the  test  battery  is 
equally  valid  for  all  comparison  groups.  In  addition, 
differences  in  mean  test  scores  are  associated  with 
corresponding  differences  in  job  performance  mea¬ 
sures.  For  all  groups,  high  test  scores  are  associated 
with  high  levels  of  job  performance  and  low  scores  are 
associated  with  lower  levels  of  job  performance. 


As  indicated  above,  the  AT-SAT  Battery  is  equally 
valid  and  fair  for  white,  African  American  and  Hispanics 
as  well  as  male  and  female  groups.  It  was  also  shown  in 
Chapter  5.5  that  there  is  a  strong  positive  relationship 
between  AT-SAT  test  scores  and  job  performance  as  an 
air  traffic  controller.  At  the  same  time,  the  FAA  has  the 
responsibility  to  try  to  have  the  workforce  demographics 
reflect  the  population  of  the  nation  in  spite  of  mean  test 
score  differences  between  groups.  We  believe  that  the 
solution  to  the  apparent  contradictory  goals  of  hiring 
applicants  with  the  highest  potential  for  high  job  perfor¬ 
mance  and  maintaining  an  employee  demographic  pro¬ 
file  that  reflects  the  nation’s  population  is  to  staff  the 
ATCS  positions  with  the  use  of  targeted  recruiting 
efforts.  Simply  stated,  targeting  recruiting  is  the  process 
of  searching  for  applicants  who  have  a  higher  than 
average  probability  of  doing  well  on  the  AT-SAT  test 
battery  and,  therefore,  have  the  skills  and  abilities  re¬ 
quired  for  performance  as  an  ATCS.  For  example,  one 
recruiting  effort  might  focus  on  schools  that  attract 
students  with  high  math  ability. 

Figure  5.6.5  shows  the  distribution  of  AT-SAT  scores 
from  the  pseudo-applicant  sample,  including  scores  for 
all  sample  members,  females,  Hispanics,  and  African 
Americans.  Two  important  observations  can  be  made 
from  an  examination  of  Figure  5.6.5.  First,  there  are 
obvious  differences  in  mean  test  scores  between  the 
various  groups.  Secondly,  there  is  a  high  degree  of 
overlap  in  the  test  score  distributions  of  the  various 
groups.  This  high  degree  of  overlap  means  that  there  are 
many  individuals  from  each  of  the  different  groups  who 
score  above  the  test  cut  score.  These  are  the  individuals 
one  would  seek  in  a  targeted  recruiting  effort.  It  should 
be  noted  that  the  targeted  recruiting  effort  needs  to  be  a 
proactive  process  of  searching  for  qualified  candidates.  If 
no  proactive  recruitment  effort  is  made,  the  distribution 
of  applicants  is  likely  to  be  similar  to  that  observed  in 
Figure  5.6.5. 

On  the  other  hand,  the  potential  impact  of  targeted 
recruiting  on  mean  test  scores  is  shown  in  Table  5.6.9.  In 
the  total  applicant  sample,  18.8%  of  the  applicants 
would  likely  pass  at  the  70  cut  off.  If  applicants  from  the 
top  1 0%  of  the  black  population  were  recruited  so  that 
they  were  6  times  more  likely  to  apply,  about  15-5% 
would  be  expected  to  pass  at  the  70  cut  off.  The  change 
from  3.9%  (no  targeted  recruiting)  to  15.5%  (with 
targeted  recruiting)  represents  an  increase  of  about  300% 
in  the  black  pass  rate. 
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CHAPTER  6 


The  Relationship  of  FAA  Archival  Data  to  AT-SAT  Predictor  and  Criterion  Measures 

Carol  A.  Manning  and  Michael  C.  Heil 
Federal  Aviation  Administration,  Civil  Aeromedical  Institute 


The  FAA  Civil  Aeromedical  Institute  (CAMI)  has 
conducted  research  in  the  area  of  air  traffic  controller 
selection  and  training  for  nearly  3  decades.  As  a  result  of 
this  research,  CAMI  established  several  Air  Traffic  Con¬ 
trol  Specialist  (ATCS)  data  bases  that  contain  selection 
and  training  scores,  ratings,  and  measures  as  well  as 
demographic  information  and  other  indices  of  career 
progression.  The  archival  data  described  below  were 
matched  with  AT-SAT  predictor  test  and  criterion  per¬ 
formance  scores  for  controllers  participating  in  the  con¬ 
current  validation  study  who  agreed  to  have  their  historical 
data  retrieved  and  linked  with  the  experimental  selection 
and  performance  data. 

PREVIOUS  ATC  SELECTION  TESTS 

The  United  States  ATCS  selection  process  between 
1981  and  1992  consisted  of  two  testing  phases:  (a)  a  4 
hour  written  aptitude  examination  administered  by  the 
United  States  Office  of  Personnel  Management  (OPM); 
and  (b)  a  multi-week  screening  program  administered  by 
the  FAA  Academy.  A  description  of  these  tests  is  pre¬ 
sented  below. 

OPM  Test  Battery 

The  OPM  test  battery  included  the  Multiplex  Con¬ 
troller  Aptitude  Test,  the  Abstract  Reasoning  Test,  and 
the  Occupational  Knowledge  Test.  The  Multiplex  Con¬ 
troller  Aptitude  Test  (MCAT)  required  the  applicant  to 
combine  visually  presented  information  about  the  posi¬ 
tions  and  direction  of  flight  of  several  aircraft  with 
tabular  data  about  their  altitude  and  speed.  The  applicant’s 
task  was  to  decide  whether  pairs  of  aircraft  would  con¬ 
flict  by  examining  the  information  to  answer  the  ques¬ 
tions.  Other  items  required  computing  time-distance 
functions,  interpreting  information,  and  spatial  orienta¬ 
tion.  Performance  on  the  MCAT  was  reported  as  a  single 
score.  The  Abstract  Reasoning  Test  (ABSR)  was  a  civil 
service  examination  (OPM-157)  that  included  ques¬ 
tions  about  logical  relationships  between  either  symbols 
or  letters.  This  was  the  only  test  retained  from  the 
previous  Civil  Service  Commission  (CSC)  battery  in  use 


before  1981.  (The  other  CSC  tests  were  Computations, 
Spatial  Patterns,  Following  Oral  Directions,  and  a  test 
that  slightly  resembled  the  MCAT).  The  Occupational 
Knowledge  T est  was  a  job  knowledge  test  that  contained 
items  related  to  air  traffic  control  phraseology  and  pro¬ 
cedures.  The  purpose  of  using  the  Occupational  Knowl¬ 
edge  Test  was  to  provide  candidates  with  extra  credit  for 
demonstrated  job  knowledge. 

The  MCAT  comprised  80%  of  the  initial  qualifying 
score  for  the  OPM  battery,  while  the  Abstract  Reasoning 
Test  comprised  20%.  After  these  weights  were  applied  to 
the  raw  scores  for  each  test,  the  resulting  score  was 
transmuted  to  a  distribution  with  a  mean  of  70  and  a 
maximum  score  of  100.  If  the  resulting  Transmuted 
Composite  score  (TMC)  was  less  than  70,  the  applicant 
was  eliminated  from  further  consideration.  If,  however, 
the  applicant  earned  a  TMC  of  70  or  above,  he  or  she 
could  receive  up  to  15  extra  credit  points  (up  to  a 
maximum  score  of  100)  based  upon  the  score  earned  on 
the  Occupational  Knowledge  Test  (OKT).  Up  to  10 
extra  credit  points  (up  to  a  maximum  score  of  1 1 0)  could 
also  be  added  based  on  Veteran’s  Preference.  The  sum  of 
the  TMC  and  all  earned  extra  credit  points  was  the  final 
OPM  Rating. 

This  version  of  the  OPM  ATCS  battery  was  imple¬ 
mented  in  September  1981,  just  after  the  Air  Traffic 
Controller  strike.  For  some  time  after  the  strike,  appli¬ 
cants  were  selected  using  either  a  score  on  the  earlier  CSC 
battery  or  on  the  later  OPM  battery.  Because  of  concerns 
about  artificial  increases  in  test  scores  as  a  function  of 
training,  changes  were  made  in  October  1985  to  1) 
replace  the  versions  of  the  MCAT  that  were  used,  2) 
change  the  procedures  used  to  administer  the  MCAT, 
and  3)  change  eligibility  requirements  for  re-testing. 

Academy  Nonradar  Screening  programs 

Because  tens  of  thousands  of  people  applied  for  the 
job  of  Air  Traffic  Control  Specialist  (ATCS),  it  was 
necessary  to  use  a  paper-and-pencil  format  to  administer 
the  CSC/OPM  batteries.  With  paper-and-pencil  test¬ 
ing,  it  was  difficult  to  measure  aptitudes  that  would  be 
utilized  in  a  dynamic  environment.  Consequently,  there 


49 


continued  to  be  a  high  attrition  rate  in  ATCS  field 
training  even  for  candidates  who  successfully  completed 
the  initial  selection  process  (earning  a  qualifying  score  on 
the  CSC/OPM  selection  battery,  and  passing  both  a 
medical  examination  and  a  background  investigation.) 
In  1975,  the  Committee  on  Government  Operations 
authorized  the  FAA  Academy  to  develop  and  administer 
a  second-stage  selection  procedure  to  “provide  early  and 
continued  screening  to  insure  prompt  elimination  of 
unsuccessful  trainees  and  relieve  the  regional  facilities  of 
much  of  this  burden.” 

In  January  of  1 976,  two  programs  were  introduced  at 
the  FAA  Academy  to  evaluate  students’  ability  to  apply 
a  set  of  procedures  in  an  appropriate  manner  for  the  non¬ 
radar  control  of  air  traffic.  From  1976  until  1985, 
candidates  entered  either  the  12-week  En  Route  Initial 
Qualification  Training  program  (designed  for  new  hires 
assigned  to  en  route  facilities)  or  the  16-week  Terminal 
Initial  Qualification  Training  program  (designed  for 
new  hires  assigned  to  terminal  facilities).  While  both 
programs  were  based  on  non-radar  air  traffic  control, 
they  used  different  procedures  and  were  applied  in 
different  types  of  airspace.  Academy  entrants  were  as¬ 
signed  to  one  program  or  the  other  on  a  more-or-less 
random  basis  (i.e.,  no  information  about  their  aptitude, 
as  measured  by  the  CSC/OPM  rating,  was  used  to  assign 
them  to  an  “option”  or  facility).  Those  who  successfully 
completed  one  of  the  programs  went  on  to  a  facility  in 
the  corresponding  option.  Those  who  did  not  success¬ 
fully  complete  one  of  the  programs  were  separated  from 
the  GS-21 52  job  series. 

Both  the  En  Route  and  Terminal  Screen  programs 
contained  academic  tests,  laboratory  problems,  and  a 
Controller  Skills  Test.  The  laboratory  problems,  each 
one-half  hour  in  length,  required  the  student  to  apply  the 
principles  of  non-radar  air  traffic  control  learned  during 
the  academic  portions  of  the  course  to  situations  in 
which  simulated  aircraft  moved  through  a  synthetic 
airspace.  Student  performance  was  evaluated  by  certified 
air  traffic  control  instructors.  Two  scores,  a  Technical 
Assessment  (based  on  observable  errors  made)  and  an 
Instructor  Assessment  (based  on  the  instructor’s  rating 
of  the  student’s  potential)  were  assigned  by  the  grading 
instructor  for  each  problem.  These  assessment  scores 
were  then  averaged  to  yield  an  overall  laboratory  score  for 
a  single  problem. 

The  Controller  Skills  Test  (CST)  measured  the  appli¬ 
cation  of  air  traffic  control  principles  to  resolve  air  traffic 
situations  in  a  speeded  paper-and-pencil  testing  situa¬ 
tion.  The  composite  score  in  the  program  was  based  on 


a  weighted  sum  of  the  Block  Average  (BA;  the  average  of 
scores  from  the  academic  block  tests),  the  Comprehen¬ 
sive  Phase  Test  (CPT;  a  comprehensive  test  covering  all 
academic  material),  the  Lab  Average  (the  average  score 
on  the  best  5  of  6  graded  laboratory  problems),  and  the 
Controller  Skills  Test  (CST).  A  composite  grade  of  70 
was  required  to  pass.  From  1976  until  1985,  the  same 
weights  were  applied  to  the  program  components  of  both 
the  En  Route  and  Terminal  Screen  programs  to  yield  the 
overall  composite  score:  2%  for  the  Block  Average,  8% 
for  the  Comprehensive  Phase  Test,  65%  for  the  Lab 
Average,  and  25%  for  the  CST. 

For  those  candidates  entering  the  Academy  after  the 
Air  T raffle  Controller  strike  of  1 98 1 ,  the  pass  rate  in  the 
En  Route  Screen  program  was  52.3%  and  the  pass  rate 
in  the  Terminal  Screen  program  was  67.8%.  The  pass 
rate  in  both  programs  combined  was  58.0%.  In  October 
of  1985,  the  two  programs  were  combined  to  create  the 
Nonradar  Screen  program.  The  purpose  of  using  a  single 
program  was  to  allow  facility  assignments  to  be  based, 
when  possible,  upon  the  final  grade  earned  in  the  pro¬ 
gram.  The  Nonradar  Screen  program  was  based  upon  the 
En  Route  screen  program  (containing  the  same  lessons 
and  comparable  tests  and  laboratory  problems).  It  was 
necessary  to  change  the  weights  applied  to  the  individual 
component  scores  of  the  Nonradar  Screen  program  to 
maintain  the  average  pass  rate  obtained  for  both  the  En 
Route  and  Terminal  screen  programs.  The  weights  used 
in  the  Nonradar  Screen  program  to  yield  the  overall 
composite  score  were:  8%  for  the  Block  Average,  12% 
for  the  Comprehensive  Phase  Test,  60%  for  the  Lab 
Average,  and  20%  for  the  CST.  The  pass  rate  for  the 
Nonradar  Screen  program  was  56.6%. 

The  Pre-Training  Screen 

In  1992,  the  Nonradar  Screen  program  was  replaced 
with  the  Pre-Training  Screen  (PTS)  as  the  second-stage 
selection  procedure  for  air  traffic  controllers.  The  goals 
of  using  the  PTS  were  to  1)  reduce  the  costs  of  ATCS 
selection  (by  reducing  the  time  required  for  screening 
controllers  from  approximately  9  weeks  to  5  days),  2) 
maintain  the  validity  of  the  ATCS  selection  system,  and 
3)  support  agency  cultural  diversity  goals.  The  PTS 
consisted  of  the  following  tests:  Static  Vector/Continu¬ 
ous  Memory,  Time  Wall/Pattern  Recognition,  and  Air 
Traffic  Scenarios  Test.  Broach  &  Brecht-Clark  (1994) 
conducted  a  predictive  validity  study  using  the  final 
score  in  the  ATCS  screen  as  the  criterion  measure.  They 
found  that  the  PTS  added  20%  to  the  percentage  of 
variance  explained  in  the  Nonradar  Screen  Program  final 
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score,  over  and  above  the  contribution  made  by  the 
OPM  test.  Broach  &  Brecht-Clark  (1994)  also  described 
a  concurrent  validation  study  conducted  using  297 
developmental  and  Full  Performance  Level  (FPL)  con¬ 
trollers.  The  criterion  used  for  this  study  was  a  composite 
of  supervisor  ratings  and  times  to  complete  field  train¬ 
ing,  along  with  performance  in  the  Radar  Training 
program.  The  corrected  multiple  correlation  between 
PTS  final  score  and  the  training  composite  score  was  .25 
as  compared  with  .19,  which  was  the  multiple  correla¬ 
tion  between  the  ATCS  screen  score  and  the  training 
composite. 

Radar  Training  (Phase  XA) 

A  second  screening  program,  the  En  Route  Basic 
Radar  Training  Course  (otherwise  known  as  RTF),  was 
administered  to  en  route  developmental  who  had  com¬ 
pleted  their  Radar  Associate/Nonradar  on-the-job  train¬ 
ing.  The  RTF  course  was  a  pass/fail  course,  and 
developmental  who  did  not  pass  were  unable  to  proceed 
in  further  radar  training  at  their  facilities  unless  they 
recycled  and  later  passed  the  course.  However,  the  pass 
rate  in  this  phase  of  training  exceeded  98%.  The  RTF 
course  paralleled  the  Nonradar  Screen  program,  includ¬ 
ing  an  average  grade  on  block  tests  (2%  of  the  final 
grade),  a  comprehensive  phase  test  (8%  of  the  final 
grade),  an  average  grade  for  laboratory  evaluations  (65% 
of  the  final  grade),  and  a  Controller  Skills  Test  (25%  of 
the  final  grade.) 

OTHER  ARCHIVAL  DATA  OBTAINED 
FOR  ATC  CANDIDATES 

Biographical  Questionnaire 

Additional  information  about  controller  demograph¬ 
ics  and  experience  was  obtained  from  data  provided  by 
Academy  entrants  during  the  first  week  they  attended 
one  of  the  Academy  screening  programs  and  obtained 
from  the  Consolidated  Personnel  Management  Infor¬ 
mation  System  (CPMIS).  New  entrants  completed  a 
Biographical  Questionnaire  (BQ).  Different  BQ  items 
were  used  for  those  entering  the  Nonradar  Screen  Pro¬ 
gram  at  various  times.  The  BQ  questions  concerned  the 
amount  and  type  of  classes  taken,  grades  earned  in  high 
school,  amount  and  type  of  prior  air  traffic  and/or  aviation 
experience,  reason  for  applying  for  the  job,  expectations 
about  the  job,  and  relaxation  techniques  used. 


VanDeventer  (1983)  found  that  the  biographical 
question  related  to  grades  in  high  school  mathematics 
courses  loaded  .31  on  a  factor  defined  by  pass/fail  status 
in  the  Academy  screening  program.  Taylor,  VanDeventer, 
Collins,  &  Boone  (1983)  found  that,  for  a  group  of  1980 
candidates,  younger  people  with  higher  grades  in  high 
school  math  and  biology,  pre-FAA  ATC  experience,  and 
fewer  repetitions  of  the  CSC  test,  and  a  self-assessment 
of  performance  in  the  top  1 0%  of  all  controllers  were 
related  to  an  increased  probability  of  passing  the  Nonradar 
Screen  program.  Collins,  Manning,  &  Taylor  (1984) 
found  that,  for  a  group  of  trainees  entering  the  Academy 
between  1981  and  1983,  the  following  were  related  to 
pass/fail  status  in  the  Nonradar  Screen  program:  higher 
grades  in  high  school  math,  physical  science,  and  biology 
classes,  a  higher  overall  high  school  grade  point  average, 
younger  age,  not  being  a  member  of  the  armed  forces, 
taking  the  OPM  test  only  one  time,  expectations  of 
staying  in  ATC  work  more  than  3  years,  and  a  self- 
assessment  that  the  trainee's  performance  would  be  in 
the  top  1 0%  of  all  ATCSs  were  positively  related  to  pass/ 
fail  status.  Collins,  Nye,  &  Manning  (1990)  found,  for 
agroup  of  Academy  entrants  between  October  1985  and 
September  1 987,  that  higher  mathematics  grades  in  high 
school,  higher  overall  high  school  grade  point  average, 
self  assessment  that  less  time  will  be  required  to  be 
effective  as  an  ATCS,  self-assessment  that  the  trainee’s 
performance  level  will  be  in  the  top  10%  of  all  ATCSs, 
and  having  taken  the  OPM  test  fewer  times  were  related 
to  pass/fail  status  in  the  Academy  screening  program. 

16PF  and  Experimental  Tests 

Also  available  were  scores  from  the  Sixteen  Personal¬ 
ity  Factor  (16PF),  which  is  administered  during  the 
medical  examination  and  scored  with  a  revised  key 
(Cattell  &:  Eber,  1962;  Convey,  1984;  Schroeder  & 
Dollar,  1997).  Other  tests  and  assessments  were  admin¬ 
istered  during  the  first  week  of  the  Academy  screening 
programs;  however,  they  were  often  administered  to  a 
limited  number  of  classes.  Consequently,  these  tests 
would  have  been  taken  by  only  a  few  of  the  controllers 
who  passed  the  Academy,  became  certified  in  an  en  route 
facility,  and  eventually  participated  in  the  concurrent 
validation  study.  Only  the  Mathematics  Aptitude  Test 
was  taken  by  a  sufficient  number  of  participants  to 
include  in  these  analyses. 
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ARCHIVAL  CRITERION  MEASURES 

Field  Training  Performance  Measures  as  Criteria 

Description  of  En  Route  ATCS  Field  Training 

In  the  en  route  option,  the  unit  of  air  traffic  control 
operation  is  the  sector,  a  piece  of  airspace  for  which  a 
team  of  2-3  controllers  is  responsible  (during  times  of 
slow  traffic,  only  one  controller  may  be  responsible  for  a 
sector).  A  group  of  between  5-8  sectors  is  combined  into 
what  is  called  an  area  of  specialization.  An  en  route 
controller  is  assigned  to  only  one  area  of  specialization, 
but  is  responsible  for  controlling  traffic  for  all  sectors 
within  that  area.  The  team  of  en  route  controllers  work¬ 
ing  at  a  sector  handles  duties  related  to:  Radar  separation 
of  aircraft  (radar  duties;  including  formulating  clear¬ 
ances  to  ensure  separation  and  delivering  them  by  radio 
to  pilots,  handing  off  responsibility  for  an  aircraft  to 
another  controller);  assisting  the  radar  controller  (radar 
associate  duties;  including  maintaining  records  about 
clearances  that  have  been  issued  or  other  changes  in  the 
flight  plan  of  an  aircraft,  identifying  potential  problems, 
communicating  information  not  directly  related  to  air¬ 
craft  separation  of  aircraft  to  pilots  or  other  controllers); 
or  supporting  other  activities  (assistant  controller  duties; 
including  entering  data  into  the  computer,  ensuring  that 
all  records  of  flight  progress  are  available  for  the  control¬ 
ler  in  charge). 

En  route  controllers  are  usually  trained  as  assistant 
controllers  first,  then  given  training  on  increasingly 
difficult  responsibilities  (radar  associate  duties,  then 
radar).  Training  on  concepts  is  conducted  in  the  class¬ 
room,  before  being  applied  in  a  laboratory  setting,  and 
then  reinforced  during  on-the-job  training  (OJT),  which 
is  conducted  in  a  supervised  setting.  At  some  facilities,  all 
radar  associate  training  is  completed  before  radar  train¬ 
ing  begins.  At  other  facilities,  training  is  conducted  by 
position:  Both  radar  associate  and  radar  training  are 
provided  for  a  specific  position  before  training  begins  on 
the  next  position.  At  one  point  in  time,  en  route  controllers 
could  have  taken  up  to  9  phases  of  field  training,  depending 
on  the  way  training  was  provided  at  the  facility. 

Measures  of  Performance  in  Field  Training 

Several  measures  of  training  performance  were  ob¬ 
tained  for  each  phase  of  air  traffic  control  field  training. 
For  each  phase  of  training,  the  start  and  completion 
dates,  the  number  of  hours  used  to  complete  on-the-job 
training  (OJT),  the  grade  (Pass,  Fail,  or  Withdraw),  and 
a  rating  of  controller  potential,  measured  on  a  6-point 


scale,  (provided  by  an  instructor  or  supervisor  who  most 
frequently  observed  the  student  during  that  phase)  were 
collected.  This  information  was  compiled  to  derive 
measures  of  training  performance,  such  as  the  amount  of 
time  (in  years)  required  to  reach  full  performance  level 
(FPL)  status,  mean  instructor  ratings  of  potential  com¬ 
puted  for  OJT  phases  (called  the  Indication  of  Perfor¬ 
mance),  the  amount  of  time  (in  calendar  days)  required 
to  complete  OJT  in  certain  training  phases,  and  the  total 
number  of  OJT  hours  required  to  complete  those  phases. 
Data  were  used  from  only  phases  IX  and  XII  because 
those  phases  included  the  first  two  sectors  on  which 
nonradar/radar  associate  (Phase  IX)  and  radar  (Phase 
XII)  training  were  provided. 

These  measures  of  training  performance  were  col¬ 
lected  because  they  were  readily  available  for  most  train¬ 
ees,  but  a  number  of  outside  factors  besides  aptitude  and 
technical  proficiency  could  have  affected  their  value. 
Time  required  to  reach  FPL  status  could  be  affected  by 
delays  in  training  caused  by  a  number  of  factors,  includ¬ 
ing  the  need  for  management  to  use  a  trainee  to  control 
traffic  on  sectors  on  which  he/she  had  already  certified 
instead  of  allowing  him/her  to  participate  in  OJT,  the 
number  of  other  students  undergoing  OJT  in  the  same 
airspace  at  the  same  time  (limiting  an  individual’s  access 
to  OJT),  or  the  number  of  trainees,  (affecting  the  avail¬ 
ability  of  the  training  simulation  laboratory).  The  num¬ 
ber  of  OJT  hours  required  to  certify  on  a  specific  sector 
could  be  affected  by  the  type  of  traffic  the  student 
controlled  during  training  or  the  difficulty  of  the  sector. 
The  subjective  rating  of  trainee  potential  could  be  af¬ 
fected  by  a  number  of  rating  biases  familiar  to  psycholo¬ 
gists,  such  as  halo,  leniency,  etc.  In  spite  of  the 
measurement  problems  associated  with  these  training 
performance  measures,  they  were  the  best  measures 
available  for  many  years  to  describe  performance  in 
ATCS  technical  training  programs. 

HISTORICAL  STUDIES  OF  VALIDITY  OF 
ARCHIVAL  MEASURES 

Brokaw  (1984)  reviewed  several  studies  examining 
the  relationship  between  aptitude  tests  and  performance 
in  both  air  traffic  control  training  and  on  the  job.  He 
described  an  early  study  (Taylor,  1952)  that  identified  a 
set  of  9  tests  having  zero-order  correlations  of  .2  or  above 
with  supervisor  job  performance  ratings  or  composite 
criteria.  A  selection  battery  that  included  the  following 
tests  was  recommended  but  not  implemented:  Memory 
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for  Flight  Information,  Air  Traffic  Problems  I  &  II, 
Flight  Location,  Coding  Flight  Data  I,  Memory  for 
Aircraft  Position,  Circling  Aircraft,  Aircraft  Position, 
and  Flight  Paths. 

A  more  extensive  study  was  performed  during  a  joint 
Air  Force  Personnel  Laboratory  and  Civil  Aeronautics 
Administration  collaboration  (Brokaw,  1957).  Thirty- 
seven  tests  were  administered  to  1 30  trainees  in  an  ATC 
school.  Criteria  were  based  on  performance  in  the  ATC 
course,  including  grades  for  the  lecture,  instructor  rat¬ 
ings,  and  a  composite  of  ratings  from  multiple  instruc¬ 
tors.  Tests  related  to  one  or  more  of  the  training  criteria 
involved  Computational  and  Abstract  Reasoning  (in¬ 
cluding  Dial  and  Table  Reading  and  Arithmetic  Reason¬ 
ing  tests),  Perceptual  and  Abstract  Reasoning,  Verbal 
Tests,  Perceptual  Speed  and  Accuracy,  and  Tempera¬ 
ment.  The  multiple  correlation  of  four  tests  (Air  Traffic 
Problems,  Arithmetic  Reasoning,  Symbol  Reasoning 
and  Perceptual  Speed,  and  Code  Translation)  with  the 
instructor  rating  was  .51. 

A  follow-up  study  (Brokaw,  1959)  was  conducted  to 
examine  the  relationship  between  the  experimental  se¬ 
lection  battery  and  supervisor  ratings  of  on-the-job 
performance.  The  multiple  correlation  of  the  same  four 
tests  with  the  supervisor  rating  was  .34.  Trites  (1961) 
conducted  a  second  follow-up  study  using  Brokaw’ s 
1957  sample,  obtaining  supervisor  ratings  after  hire. 
Symbolic  Reasoning  and  Perceptual  Speed,  Abstract 
Reasoning  (DAT),  Space  Relations  (DAT),  and  Spatial 
Orientation  (AFOQT),  were  all  significantly  related  to 
supervisor  ratings  provided  in  1961  (correlations  were 
.21,  .18,  .18,  and  .23,  respectively.)  The  correlations 
were  reduced  somewhat  when  partial  correlations  were 
computed  holding  age  constant.  Furthermore,  the  Fam¬ 
ily  Relations  Scale  from  the  California  Test  Bureau 
(CTB)  California  Test  of  Personality  had  a  .21  correla¬ 
tion  with  the  1961  supervisor  ratings.  The  correlation 
was  not  reduced  by  partialing  out  the  effect  of  age. 

Trites  &  Cobb  (1963),  using  another  sample,  found 
that  experience  in  ATC  predicted  performance  both  in 
ATC  training  and  on  the  job.  However,  aptitude  tests 
were  better  predictors  of  performance  in  training  than 
was  experience.  Five  aptitude  tests  (DAT  Space  Rela¬ 
tions,  DAT  Numerical  Ability,  DAT  Abstract  Reason¬ 
ing,  CTMM  Analogies,  and  Air  Traffic  Problems)  had 
correlations  of  .34,  .36,  .45,  .28,  and  .37  with  academic 
and  laboratory  grades,  while  the  correlations  with  super¬ 
visor  ratings  were  lower  (.04,  .09,  .12,  .13,  and  .15, 
respectively)  for  en  route  controllers. 


Other  studies  have  examined  relationships  between 
experimental  tests  and  performance  in  the  FAA  Academy 
Screening  Program.  Cobb  &  Mathews  (1972)  developed 
the  Directional  Headings  T est  (DHT)  to  measure  speeded 
perceptual-discrimination  and  coding  skills.  They  found 
that  the  DHT  correlated  .41  with  a  measure  of  training 
performance  for  a  group  of  air  traffic  control  trainees 
who  had  already  been  selected  using  the  CSC  selection 
battery.  However,  the  test  was  highly  speeded,  and  was 
consequently  difficult  to  administer. 

Boone  (1979),  in  a  study  using  1828  ATC  trainees, 
found  that  the  Dial  Reading  Test  (DRT;  developed  at 
Lackland  AFB  for  selecting  pilot  trainees)  and  the  DHT 
had  correlations  of  .27  and  .23,  respectively,  with  the 
standardized  laboratory  score  in  the  Academy  screen 
program.  An  experimental  version  of  the  MCAT  corre¬ 
lated  .28  with  the  lab  score.  In  the  same  study,  CSC  24 
(Computations)  and  CSC  157  (Abstract  Reasoning) 
correlated  .10  and  .07,  respectively,  with  the  laboratory 
score. 

Schroeder,  Dollar  &  Nye  (1990)  administered  the 
DHT  and  DRT  to  a  group  of  1 126  ATC  trainees  after 
the  air  traffic  control  strike  of  1 98 1 .  They  found  that  the 
DHT  correlated  .26  (.47  after  adjustment  for  restriction 
in  range)  with  the  final  score  in  the  Academy  screening 
program,  while  the  DRT  correlated  .29  (.52  after  adjust¬ 
ment  for  restriction  in  range)  with  the  final  score  in  the 
Academy  screening  program.  MCAT  correlated  .17  and 
Abstract  Reasoning  correlated  .16  with  the  final  score, 
though  those  two  tests  had  been  used  to  select  the  trainees. 

Manning,  Della  Rocco,  and  Bryant,  (1989)  found 
statistically  significant  (though  somewhat  small)  corre¬ 
lations  between  the  OPM  component  scores  and  mea¬ 
sures  of  training  status,  instructor  ratings  of  trainee 
potential,  and  time  to  reach  FPL  (a  negative  correlation) 
for  1981-1985  graduates  of  the  en  route  Academy  screen¬ 
ing  program.  Correlations  (not  corrected  for  restriction 
in  range)  of  the  MCAT  with  training  status,  OJT  hours 
in  Phase  IX,  mean  Indication  of  Performance  for  Phases 
VIII-X,  OJT  hours  in  Phase  XII,  Indication  of  Perfor¬ 
mance  in  Phases  XI -XIII,  and  time  to  FPL  were  -.  12,  .05, 
.11,  .08,  .11,  and  -.11,  respectively.  Correlations  (not 
corrected  for  restriction  in  range)  of  the  Abstract  Reason¬ 
ing  Test  with  the  same  measures  of  field  training  perfor¬ 
mance  were  .03,  .04,  .03,  .09,  .01,  and -.02,  respectively. 

Manning  et  al.  also  examined  correlations  between 
component  scores  in  the  en  route  Academy  screening 
program  and  the  same  measures  of  field  training  perfor¬ 
mance.  Correlations  (not  corrected  for  restriction  in 
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range)  of  the  Lab  Average  with  training  status,  OJT 
hours  in  Phase  IX,  Indication  of  Performance  in  Phases 
VIII-X,  OJT  hours  in  Phase  XII,  Indication  of  Perfor¬ 
mance  in  Phase  XII,  and  Time  to  FPL  were  -.24,  -.06, 
.23,  -.12,  .24,  and  -.16,  respectively.  Correlations  (not 
corrected  for  restriction  in  range)  of  the  Nonradar  Con¬ 
troller  Skills  Test  with  the  same  training  performance 
measures  were  -.08,  -.02,  .11,0,  .07,  and  -.09.  Correla¬ 
tions  (not  corrected  for  restriction  in  range)  of  the 
Final  Score  in  the  Screen  with  the  same  training 
performance  measures  were  -.24,  -.06,  .24,  -.10,  .24, 
and  -.18,  respectively. 

Manning  (1991)  examined  the  same  relationships  for 
FY-96  graduates  of  the  ATC  screen  program,  assigned  to 
the  en  route  option.  Correlations  (not  corrected  for 
restriction  in  range)  of  the  MCAT,  Abstract  Reasoning 
Test,  and  OPM  rating  with  status  in  field  training  were 
.09,  .03,  and  .09,  respectively.  When  adjusted  for  restric¬ 
tion  in  range,  these  correlations  were  .24,  .04,  and  .35, 
respectively.  Correlations  (not  corrected  for  restriction 
in  range)  of  the  Lab  Average,  Controller  Skills  Test,  and 
Final  Score  in  the  Screen  with  status  in  field  training  were 
.21 , .  16,  and  .24,  respectively.  When  adjusted  for  restric¬ 
tion  in  range,  these  correlations  were  .36,  .26,  and  .44, 
respectively. 

RELATIONSHIPS  BETWEEN  ARCHIVAL 
DATA  AND  AT-SAT  MEASURES 

Relationship  of  Archival  and  AT-SAT  Criterion 
Measures 

It  is  expected  that  the  measures  of  field  training 
performance  used  during  the  1980s  as  criterion  measures 
to  assess  the  validity  of  the  OPM  test  and  Academy 
screening  programs  will  also  be  significantly  correlated 
with  the  AT-SAT  criterion  measures.  The  magnitude  of 
these  correlations  might  be  lower  than  those  computed 
among  the  original  archival  measures  because  several 
years  have  elapsed  between  the  time  when  field  training 
occurred  and  the  administration  of  the  AT-SAT  crite¬ 
rion  measures. 

Table  6.1  shows  correlations  between  the  archival 
criterion  measures  and  the  AT-SAT  criterion  measures. 
These  correlations  have  not  been  adjusted  for  restriction 
in  the  range  of  the  training  performance  measures. 
Correlations  between  days  and  hours  in  the  same  phase 
of  training  were  high,  and  correlations  between  days  and 
hours  in  different  phases  of  training  were  moderate. 
Correlations  between  the  Indication  of  Performance  and 
time  in  the  same  or  different  phases  of  training  were  non¬ 


significant,  but  the  correlation  between  the  Indication  of 
Performance  in  Phase  IX  and  the  Indication  of  Perfor¬ 
mance  in  Phase  XII  was  moderately  high. 

Correlations  between  time  in  training  phases  and  the 
composite  criterion  rating  were  statistically  significant  at 
the  .01  level,  but  were  not  very  high.  The  CBPM  was 
significantly  correlated  with  only  the  days  and  hours  in 
Phase  XII,  which  described  the  outcome  of  training  on 
the  first  two  radar  sectors.  It  makes  sense  that  the  CBPM 
would  relate  particularly  to  performance  in  radar  train¬ 
ing  because  the  CBPM  contains  items  based  on  radar 
concepts.  Correlations  of  both  the  ratings  and  the  CBPM 
with  the  Indication  of  Performance  variables  were  either 
non-significant  or  not  in  the  expected  direction  (i.e., 
correlations  of  AT-SAT  criteria  with  the  indication  of 
performance  variables  should  be  positive  while  correla¬ 
tions  with  training  times  should  be  negative.) 

Relationship  of  Archival  Predictors  with  Archival 
and  AT-SAT  Criterion  Measures 

Because  the  archival  and  AT-SAT  criterion  measures 
are  related,  and  because  the  ATCS  job  has  changed  little 
in  the  last  1 5  years,  the  selection  procedures  previously 
used  by  the  FAA  and  the  AT-SAT  criterion  measures 
should  be  correlated.  The  following  two  tables  show 
relationships  of  the  OPM  rating  and  performance  in  the 
Academy  screen  program  with  both  the  archival  and  AT- 
SAT  criterion  measures.  It  should  be  remembered  that 
the  controllers  who  participated  in  the  concurrent  vali¬ 
dation  study  were  doubly  screened — first  on  the  basis  of 
their  OPM  rating,  then,  second  on  the  basis  of  their  score 
in  the  Academy  Screen  program.  Current  FPLs  were  also 
reduced  in  number  because  some  failed  to  complete 
training  successfully.  Thus,  there  is  considerable  restric¬ 
tion  in  the  range  of  the  selection  test  scores. 

Table  6.2  shows  correlations  of  the  archival  selection 
test  scores  (OPM  Rating,  final  score  in  the  Nonradar 
Screen  program,  and  final  score  in  the  Radar  Training 
program)  with  both  the  archival  criterion  measures  and 
the  AT-SAT  criterion  measures.  Correlations  adjusted 
for  restriction  in  the  range  of  the  predictors  are  in 
parentheses  after  the  restricted  correlations.  The  OPM 
rating  correlated  .  1 8  with  the  final  score  in  the  Nonradar 
Screen  program  and  .  1 1  with  the  final  score  in  the  Radar 
Training  program.  The  OPM  rating  had  very  low  corre¬ 
lations  with  archival  criterion  measures  (although  it  was 
significantly  correlated  with  the  Indication  of  Perfor¬ 
mance  in  initial  radar  training.)  The  OPM  rating  was 
not  significantly  correlated  with  the  rating  composite, 
but  was  significantly  correlated  with  the  CBPM  score 
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( r  =  .22.)  The  final  score  in  the  Nonradar  Screen  pro¬ 
gram  was  significantly  correlated  with  training  times  in 
both  phases  of  field  training  and  with  time  to  reach  FPL 
status,  but  not  with  either  Indication  of  Performance 
measure.  The  final  score  in  the  Nonradar  Screen  pro¬ 
gram  was  also  significantly  correlated  with  both  AT- 
SAT  criterion  measures,  although  the  correlation  with 
the  CBPM  (.34)  was  much  higher  than  the  correlation 
with  the  rating  composite  (.12).  The  final  score  in  the 
Radar  T raining  program  was  also  significantly  correlated 
with  training  times,  and  was  significantly  correlated  with 
the  Indication  of  Performance  for  initial  radar  training. 
It  was  also  significantly  correlated  with  both  the  AT- 
SAT  rating  composite  (.17)  and  the  CBPM  score  (.21). 

Table  6.3  shows  correlations  of  the  performance- 
based  components  of  the  archival  selection  procedures 
(Nonradar  Screen  program  and  Radar  Training  pro¬ 
gram)  with  both  the  archival  and  AT-SAT  criterion 
measures.  The  correlations  at  the  top  of  the  table  are 
intercorrelations  between  archival  selection  procedure 
components.  Of  the  OPM  component  scores,  only  the 
Abstract  Reasoning  Test  and  the  MCAT  were  signifi¬ 
cantly  correlated. 

Correlations  of  components  of  the  OPM  battery  with 
component  scores  from  the  Nonradar  Screen  program 
and  the  Radar  Training  program  were  fairly  low,  al¬ 
though  some  statistically  significant  correlations  with 
scores  from  the  laboratory  phases  were  observed.  The 
MCAT  was  significantly  correlated  with  Instructor  As¬ 
sessment  and  Technical  Assessment  from  both  the 
Nonradar  Screen  and  Radar  Training  programs,  and  was 
significantly  correlated  with  the  Nonradar  CST.  Ab¬ 
stract  Reasoning  was  significantly  correlated  with  only 
the  nonradar  Average  Technical  Assessment  and  the 
nonradar  CST.  The  OKT  had  a  small  but  statistically 
significant  correlation  with  the  Nonradar  Average  In¬ 
structor  Assessment. 

The  correlation  between  the  Average  Instructor  As¬ 
sessment  and  Average  Technical  Assessment  from  each 
course  was  very  high  (.79  and  .83,  for  the  Nonradar 
Screen  program  and  Radar  Training  program,  respec¬ 
tively.)  Across  programs  the  Average  Instructor  Assess¬ 
ment  and  Average  Technical  Assessment  had  significant 
correlations  that  ranged  between  about  .02  and  .35.  The 
Controller  Skills  Tests  for  both  courses  had  significant 
correlations  with  the  Nonradar  Average  Technical  and 
Average  Instructor  Assessment.  While  the  Nonradar 
CST  was  significantly  correlated  with  the  Radar  Average 
Instructor  and  Technical  Assessments,  the  Radar  CST 


was  not.  Correlation  between  CSTs  was  only  .25,  which 
was  similar  to  correlations  with  other  components  of  the 
Nonradar  Screen  and  Radar  Training  programs. 

Correlations  of  OPM  component  scores  with  the 
rating  criterion  measure  were  all  low  and  non-signifi¬ 
cant.  However,  the  MCAT  and  Occupational  Knowl¬ 
edge  Tests  were  both  significantly  correlated  with  the 
CBPM  score. 

Of  the  components  of  the  Nonradar  Screen  and 
Radar  T raining  programs,  the  Average  T echnical  Assess¬ 
ment  had  significant  negative  correlations  with  training 
times  (though  not  with  the  Indication  of  Performance 
measures).  The  Radar  Technical  Assessment  was  corre¬ 
lated  both  with  time  spent  in  Radar  Associate  and  Radar 
field  training  phases,  while  the  Nonradar  Technical 
Assessment  was  only  correlated  with  time  spent  in  Radar 
field  training  phases.  Both  were  significantly  correlated 
with  the  Time  required  to  reach  FPL  status.  The  Radar 
Average  Instructor  Assessment  was  significantly  corre¬ 
lated  with  time  spent  in  Radar  Associate  field  training. 
Interestingly,  the  Nonradar  Average  Instructor  Assess¬ 
ment  was  not  related  to  time  in  phases  of  field  training, 
although  its  correlation  with  the  Nonradar  Average 
Technical  Assessment  was  about  .8.  Both  the  Nonradar 
and  Radar  Average  Instructor  Assessment  were  signifi¬ 
cantly  correlated  with  time  to  reach  FPL  status. 

The  Nonradar  and  Radar  Average  Technical  Assess¬ 
ments  and  Average  Instructor  Assessments  were  all  sig¬ 
nificantly  related  to  the  CBPM  score,  though  only  the 
Nonradar  Average  Instructor  Assessment  was  signifi¬ 
cantly  related  to  the  rating  composite.  Both  the  Nonradar 
and  Radar  Controller  Skills  Tests  were  significantly 
correlated  with  the  CBPM.  This  relationship  is  not 
surprising  because  the  CSTs  and  CBPM  have  similar 
formats:  They  all  present  a  sample  air  traffic  situation 
and  ask  the  respondent  to  answer  a  multiple  choice 
question  (under  time  pressure)  involving  the  application 
of  ATC  procedures.  The  CSTs  were  presented  in  a 
paper- and-pencil  format  while  the  CBPM  was  presented 
using  a  dynamic  computer  display. 

Relationship  of  Archival  Criteria  and  High-Fidelity 
Simulation  Criteria 

Table  6.4  shows  correlations  of  the  criterion  measures 
obtained  from  the  high-fidelity  simulation  (comprising 
1 07  participants)  with  archival  performance-based  pre¬ 
dictor  and  archival  criterion  measures.  The  high-fidelity 
criterion  measures  used  in  this  analysis  included  the 
individual  scales  used  in  the  Over- the- Shoulder  rating 
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form.  Also  used  was  the  number  of  operational  errors 
made  during  the  7th  graded  scenario,  the  most  complex 
scenario  included  in  the  simulation  test.  The  high- 
fidelity  rating  scales  were  correlated  very  highly  with 
each  other  (.80  and  above).  The  number  of  operational 
errors  made  in  the  7th  graded  scenario  was  correlated  -.20 
to  -.32  with  the  high  fidelity  rating  scales,  which  were 
based  on  performance  in  all  7  graded  scenarios.  The 
high-fidelity  rating  scales  (based  on  assessments  of  maxi¬ 
mum  performance)  had  correlations  of  about  .35  to 
about  .40  with  the  AT-SAT  rating  composite  (based  on 
assessments  of  typical  performance),  and  had  correla¬ 
tions  of  about  .60  to  .65  with  the  CBPM.  The  number 
of  operational  errors  made  in  the  7th  graded  scenario  was 
not  significantly  correlated  with  either  the  AT-SAT 
rating  composite  or  the  CBPM. 

The  high-fidelity  rating  scales  were  not  correlated 
with  either  Indication  of  Performance  measure  obtained 
from  field  training  records.  OJT  hours  in  Phase  IX 
(Radar  Associate/Nonradar  training)  had  significant 
negative  correlations  with  several  individual  high-fidel¬ 
ity  rating  scales,  including  the  overall  rating.  OJT  hours 
in  Phase  XII  (field  Radar  training)  had  significant  nega¬ 
tive  correlations  with  all  high-fidelity  ratings  scales  ex¬ 
cept  Coordination.  Time  to  reach  FPL  status  had 
significant  negative  correlations  with  only  Maintaining 
efficient  air  traffic  flow  and  with  Attention  &  Situation 
Awareness. 

The  high-fidelity  rating  scales  had  higher,  significant, 
correlations  with  some  of  the  performance-based  com¬ 
ponents  of  the  archival  selection  procedures.  The  high- 
fidelity  rating  scales  were  correlated  between  about  .35 
and  .40  with  the  Average  Instructor  Assessment  from  the 
Nonradar  Screen  program,  and  were  correlated  between 
about  .5  and  .55  with  the  Average  Technical  Assessment 
from  the  Nonradar  Screen  program.  There  were  only 
two  significant  correlations  between  the  Controller  Skills 
Test  from  the  Nonradar  Screen  program  and  the  high- 
fidelity  rating  scales  (Coordination  and  Managing  Sec¬ 
tor  Workload).  The  high-fidelity  rating  scales  had  almost 
no  correlation  with  the  Average  Instructor  Assessment 
from  the  Radar  screen  program  but  were  correlated 
between  about  .55  and  .60  with  the  Average  Technical 
Assessment  from  the  Radar  screen  program.  Perfor¬ 
mance  on  the  Controller  Skills  Test  from  the  Radar 
screen  program  was  correlated  between  about  .60  and 
.71  with  the  high-fidelity  rating  scales.  Though  many  of 
these  correlations  are  statistically  significant,  they  were 
typically  based  on  fewer  than  60  participants  who  al¬ 
lowed  their  archival  data  to  be  matched  with  their 


performance  in  the  AT-SAT  testing  and  the  high  fidelity 
simulation  testing.  At  the  same  time,  it  is  interesting  to 
observe  correlations  of  the  magnitude  seen  here  between 
measures  of  performance  from  simulations  that  occurred 
recently  and  from  performance-based  selection  proce¬ 
dures  that  occurred  between  5  and  1 5  years  previously. 

Relationship  of  Archival  Measures  and  AT-SAT 
Predictors 

It  was  also  expected  that  archival  measures,  including 
archival  selection  tests  and  scores  on  experimental  tests 
administered  at  the  FAA  Academy  during  the  first  week 
of  the  Academy  screen  program  might  have  high  corre¬ 
lations  with  AT-SAT  predictor  tests.  High  correlations 
between  AT-SAT  predictors  and  other  aptitude  tests 
should  provide  evidence  supporting  interpretations  of 
the  construct  validity  of  the  AT-SAT  tests.  The  magni¬ 
tude  of  these  correlations  might  be  reduced,  however, 
because  the  experimental  tests  were  administered  be¬ 
tween  5  and  15  years  prior  to  the  concurrent  validity 
study  and  the  OPM  test  was  probably  administered 
between  6  and  16  years  previously. 

An  analysis  was  conducted  to  compute  correlations 
between  scores  on  the  OPM  selection  tests:  the  Multi¬ 
plex  Controller  Aptitude  Test  (MCAT),  the  Abstract 
Reasoning  Test,  and  the  Occupational  Knowledge  Test 
(OKT),  and  the  AT-SAT  predictor  tests.  The  MCAT, 
the  highest  weighted  component  of  the  OPM  rating, 
required  integrating  air  traffic  information  to  make 
decisions  about  relationships  between  aircraft.  Thus, 
aptitudes  required  to  perform  well  on  the  MCAT  might 
be  related  to  aptitudes  required  to  perform  well  on  the 
Air  Traffic  Scenarios  Test  (ATST).  Furthermore,  the 
skills  required  to  integrate  information  when  taking  the 
MCAT  might  be  related  to  performance  on  the  Letter 
Factories  Test,  Time  Wall,  Scan,  and  Planes  tests.  Posi¬ 
tive  correlations  of  the  AT-SAT  predictors  with  the 
MCAT,  a  test  previously  used  to  select  controllers, 
would  provide  further  evidence  of  the  validity  of  the  tests 
included  in  the  AT-SAT  battery. 

Table  6.5  shows  correlations  of  the  MCAT,  Abstract 
Reasoning  Test,  and  OKT  with  the  AT-SAT  predictor 
tests.  The  computed  correlations  are  followed  in  paren¬ 
theses  by  correlations  adjusted  for  restriction  in  the  range 
of  each  archival  selection  test.  (Correlations  for  the  OKT 
were  not  adjusted  for  restriction  in  range  because  the 
standard  deviation  of  the  OKT  after  candidates  were 
selected  was  larger  than  was  its  standard  deviation  before 
applicants  were  selected.) 
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MCAT  had  significant,  but  small,  correlations  with 
many  of  the  AT- SAT  predictor  tests:  all  measures  de¬ 
rived  from  the  Letter  Factories  test,  Applied  Math,  Time 
Wall  Time  Estimation  Accuracy  and  Perceptual  Accu¬ 
racy  scores  (but  not  Perceptual  Speed),  Air  Traffic  Sce¬ 
narios  Efficiency  and  Safety  scores  (but  not  Procedural 
Accuracy),  Analogies  Reasoning  score  (but  not  Latency 
or  Information  Processing),  Dials,  Scan,  both  Memory 
tests,  Digit  Span,  Planes  Timesharing  score  (but  not 
Projection  or  Dynamic  Visual/Spatial),  and  Angles. 

Abstract  Reasoning  was  also  significantly  correlated 
with  several  of  the  AT-SAT  predictor  tests.  The  relation¬ 
ship  of  the  most  interest  is  with  the  component  scores  of 
the  Analogies  test.  Abstract  Reasoning  might  be  expected 
to  have  a  high  correlation  with  Analogies  because  many 
items  in  both  tests  are  similar.  Thus,  it  is  not  surprising 
to  observe  a  correlation  of  .33  between  Abstract  Reason¬ 
ing  and  the  Analogies:  Reasoning  score.  However,  the 
correlation  of  Abstract  Reasoning  with  the  Latency  and 
Information  Processing  components  was  non-signifi¬ 
cant.  Abstract  Reasoning  also  correlated  with  other  AT- 
SAT  predictor  tests:  all  Letter  Factories  subscores,  Angles, 
Applied  Math,  Time  Wall:  Time  Estimation  Accuracy 
and  Perceptual  Accuracy  (but  not  Perceptual  Speed), 
both  Memory  tests,  Dials,  Scan,  and  AT  Scenarios: 
Efficiency  and  Safety  (but  not  Procedural  Accuracy). 

The  Occupational  Knowledge  Test  measured  the 
knowledge  about  aviation  and  air  traffic  control  that 
applicants  brought  to  the  job.  The  OKT  had  several 
significant  correlations  with  AT-SAT  predictor  tests, 
although  all  but  one  was  negative,  implying  that  control¬ 
lers  who  entered  the  occupation  with  less  knowledge  of 
ATC  performed  better  on  the  AT-SAT  aptitude  tests. 
OKT  was  negatively  correlated  with  Letter  Factories 
Situational  Awareness  and  Planning  &  Thinking  ahead 
scores  (but  was  not  significantly  correlated  with  number 
of  letters  correctly  placed),  both  memory  tests,  Time 
Wall  Perceptual  Accuracy  score,  and  Planes  Dynamic 
Visual/Spatial  score.  OKT  had  a  significant  positive 
correlation  with  AT  Scenarios  Procedural  Accuracy  score. 

Although  many  of  these  correlations  are  statistically 
significant,  they  are  nevertheless  small,  which  might 
appear  to  suggest  that  they  do  not  provide  evidence  of  the 
construct  validity  of  the  AT-SAT  predictor  tests.  More¬ 
over,  most  of  the  correlations  continued  to  be  rather 
small  after  they  were  adjusted  for  restriction  in  the  range 
of  the  archival  selection  tests.  However,  it  must  be 
remembered  that  the  participants  in  the  concurrent 
validity  study  were  doubly  (and  even  triply)  selected, 
because  they  first  qualified  on  the  basis  of  their  perfor¬ 


mance  on  the  OPM  test,  then  by  passing  the  Nonradar 
Screen  program  (which  had  approximately  a  40%  loss 
rate),  then  again  by  passing  field  training  (which  had 
approximately  an  additional  10%  loss  rate).  Thus,  even 
making  one  adjustment  for  restriction  in  range  does  not 
compensate  for  all  the  range  restriction  that  occurred. 
Furthermore,  performance  on  the  AT-SAT  predictor 
tests  may  have  been  influenced  by  age-related  effects. 

Archival  Experimental  Tests  and  AT-SAT  Predic¬ 
tors.  The  next  analysis  examined  the  relationship  of  the 
Dial  Reading  Test  (DRT),  the  Directional  Headings 
Test  (DHT),  and  two  other  archival  measures  of  math¬ 
ematical  aptitude  with  AT-SAT  predictor  tests.  The  Dial 
Reading  Test  is  a  paper-and-pencil  version  of  the  com¬ 
puterized  AT-SAT  Dials  test,  and  so  it  would  be  ex¬ 
pected  that  scores  would  be  highly  correlated.  The  DHT 
was  an  experimental  test  administered  to  ATC  trainees 
during  the  1970s.  the  DHT  required  comparing  three 
pieces  of  information:  A  letter  (N,  S,  E,  or  W),  a  symbol 
(A,  v,  <,  or  >),  and  a  number  (0  to  360  degrees),  all 
indicating  direction,  in  order  to  determine  whether  they 
indicated  a  consistent  or  inconsistent  directional  head¬ 
ing.  A  second  part  of  the  test  required  determining  the 
opposite  of  the  indicated  direction.  Thus,  performance 
on  the  DHT  might  be  expected  to  correlate  positively 
with  both  Angles  and  Applied  Math. 

The  Math  Aptitude  Test  was  taken  from  the  Educa¬ 
tional  Testing  Service  (ETS)  Factor  Reference  Battery 
(Ekstrom,  French,  Harman,  &  Derman,  1976).  An  item 
dealing  with  reported  grades  in  high  school  math  courses 
was  also  included  in  the  analysis  because  this  biographi¬ 
cal  information  was  previously  found  to  be  related  to 
success  in  the  Nonradar  Screen  program. 

Although  these  tests  were  administered  between  5  and 
15  years  before  the  concurrent  validation  study,  it  is 
expected  that  the  DHT  and  DRT  would  be  at  least 
moderately  related  to  performance  on  some  of  the  AT- 
SAT  predictor  tests,  especially  those  related  to  math¬ 
ematical  skills.  It  may  be  remembered  that  in  past 
research,  the  DHT  and  DRT  had  moderate  correlations 
with  criterion  measures  of  performance  in  AT  C  training. 
Thus,  positive  correlations  of  the  AT-SAT  predictors 
with  the  DHT  and  DRT  would  provide  further  evidence 
of  the  validity  of  the  AT-SAT  tests. 

Table  6.6  shows  the  relationship  of  three  AT-SAT 
predictor  tests  with  DHT,  DRT,  the  Math  Aptitude 
Test,  and  a  biographical  item  dealing  with  high  school 
math  grades.  Numbers  of  respondents  are  shown  in 
parentheses  after  the  correlation  coefficient.  As  expected, 
Applied  Math  had  a  high,  positive  correlation  with  the 
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Math  Aptitude  Test  total  score  (.63).  Applied  Math  had 
also  statistically  significant  and  reasonably  high  positive 
correlations  with  Dial  Reading  Number  Correct  (,52) 
and  Directional  Headings  Number  Correct  Part  2  (.40). 
Applied  Math  also  had  moderate,  significant  negative 
correlations  with  Dial  Reading  Number  items  wrong  (- 
.36)  and  the  biographical  item  dealing  with  high  school 
math  grades  (-.34). 

Angles  was  significantly  correlated  with  Dial  Reading 
Number  Correct  (.37)  and  Dial  Reading  Number  Wrong 
(-.28).  Angles  was  also  significantly  correlated  with  the 
Math  Aptitude  Test  (.41)  and  the  biographical  item 
dealing  with  high  school  math  grades  (-.21).  Unexpect¬ 
edly,  Angles  had  a  small  positive  (but  significant)  correla¬ 
tion  with  Directional  Headings  number  wrong  Part  2  (.  1 8). 

The  results  of  the  comparison  of  the  Dials  test  and  the 
archival  experimental  tests  was  somewhat  surprising. 
Dials  had  a  significant  positive  correlation  with  Dial 
Reading  number  correct  (.22)  and  a  significant  negative 
correlation  with  Dial  Reading  number  wrong  (-.39). 
However  the  correlation  with  Dial  Reading  number 
correct  was  low,  considering  that  Applied  Math  and 
Angles  had  higher  correlations  than  did  Dials.  However, 
Dials  did  not  contain  all  the  same  items  as  Dial  Reading. 
After  the  Alpha  testing,  certain  items  present  in  Dial 
Reading  were  removed  from  Dials,  and  other  items  were 
inserted.  Moreover,  Dial  Reading  was  presented  in  a 
paper-and-pencil  format  while  Dials  was  presented  in  a 
computerized  format.  One  might  speculate  that  the 
different  formats  were  responsible  for  the  reduced  corre¬ 
lation.  However,  it  must  be  remembered  that  Dial 
Reading  Test  was  administered  between  5  and  1 5  years 
prior  to  the  administration  of  Dials,  and  considerable 
training  and  aging  occurred  during  the  interim.  While 
air  traffic  controllers  in  the  en  route  environment  may 
not  read  dials,  they  are  trained  extensively  on  other  tasks 
involving  perceptual  speed  and  accuracy,  which  is  an 
aptitude  that  the  Dials  test  is  likely  to  measure.  Thus,  it 
is  more  likely  that  the  low  correlation  between  Dial 
Reading  and  Dials  resulted  from  changes  in  the  items, 
and  the  effects  of  time  and  aging  on  the  individuals 
taking  the  test,  rather  than  a  change  in  the  format  of  the  test. 

Pre-Training  Screen  and  AT-SAT  Predictors.  In 
1 99 1 ,  a  group  of 297  developmental  and  FPL  controllers 
participated  in  a  study  assessing  the  validity  of  the  Pre- 
Training  Screen  (Broach  &  Brecht-Clark,  1994).  Sixty- 
one  controllers  who  participated  in  the  concurrent 
validation  of  the  PTS  also  participated  in  the  AT-SAT 
concurrent  validation  in  1997/1998. 


Scoring  algorithms  used  for  the  PTS  version  of  the 
ATST  differed  from  those  used  for  the  AT-SAT  version 
of  the  ATST.  In  the  PTS  version,  the  Safety  score  was  a 
count  of  safety-related  errors  and  Delay  Time  measured 
the  amount  of  time  aircraft  were  delayed.  For  both  the 
Safety  score  and  Total  Delay  Time,  higher  scores  indi¬ 
cated  worse  performance.  In  the  AT-SAT  version,  the 
Safety  and  Efficiency  scores  were  based  on  counts  of 
errors  and  measurement  of  delays,  but  both  variables 
were  transformed  so  that  higher  scores  indicated  better 
performance.  Procedural  Accuracy  is  a  new  variable 
based  on  the  occurrence  of  errors  not  related  to  safety.  It 
is  expected  that  the  PTS  Safety  Score  would  be  more 
highly  correlated  with  the  AT-SAT  Safety  score  than 
with  the  AT-SAT  Efficiency  Score  and  that  PTS  Total 
Delay  Time  would  be  more  highly  correlated  with  the 
AT-SAT  Efficiency  Score  than  with  the  AT-SAT  Safety 
Score.  It  is  also  expected  that  the  two  PTS  scores  would 
have  significant  negative  correlations  with  the  three  AT  - 
SAT  scores. 

Table  6.7  shows  the  relationship  of  the  scores  from 
the  version  of  the  Air  Traffic  Scenarios  Test  included  in 
the  Pre-Training  Screen  with  the  version  of  the  Air 
Traffic  Scenarios  Test  included  in  AT-SAT.  As  ex¬ 
pected,  the  PTS  Safety  Score  is  more  highly  correlated 
with  the  AT-SAT  Safety  Score  than  with  the  AT-SAT 
Efficiency  Score  (and  those  correlations  are  negative). 
Also,  the  correlation  between  the  PTS  Average  Total 
Delay  Time  and  AT-SAT  Efficiency  Score  was  both 
significant  and  negative.  The  Procedural  Accuracy  score 
from  the  AT-SAT  version  had  little  relationship  with 
either  PTS  ATST  score. 

Archival  Data  and  the  Experience  Questionnaire. 
The  merging  of  the  archival  data  with  the  AT-SAT 
concurrent  validation  data  provided  an  opportunity  to 
investigate  the  construct  validity  of  the  personality  test 
contained  in  the  AT-SAT  battery.  Construct  validity  of 
the  Experience  Questionnaire  (EQ)  was  investigated 
using  the  following  methods:  principal  component  analy¬ 
sis  to  determine  structure  of  the  scale;  and  Pearson 
product-moment  correlations  to  determine  the  degree  of 
convergence  and  divergence  with  archival  16PF  data. 
The  167  items  contained  in  the  EQ  were  used  to 
calculate  14  personality  scales,  which  were  used  in  the 
analyses. 

In  terms  of  the  principal  components  analysis,  a  final 
solution  revealing  at  least  two  independent  factors  would 
provide  evidence  that  the  EQ  scales  measure  unique 
constructs.  Relationships  between  some  of  the  EQ  scales 
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would  be  anticipated,  therefore,  certain  scales  should 
load  on  the  same  factor.  However,  some  scales  should  be 
unrelated,  meaning  that  they  should  load  on  different 
factors.  For  example,  “taking  charge”  and  “decisiveness” 
are  likely  to  be  related  and  therefore  load  together  on  a 
factor.  The  variable  “concentration”,  on  the  other  hand, 
would  not  be  expected  to  have  a  high  degree  of  relation¬ 
ship  with  these  other  two  variables  and  should  load  on  a 
different  factor.  An  oblique  principal  components  analy¬ 
sis  was  conducted  using  data  collected  during  the  AT- 
SAT  concurrent  validation  study.  As  shown  in  Table 
6.8,  the  principal  components  analysis  resulted  in  the 
extraction  of  only  two  factors.  The  first  factor  accounts 
for  56%  of  the  variance,  whereas  the  second  factor 
accounts  for  only  9.49%.  Additionally,  these  two  factors 
are  correlated  with  each  other  (r=.  54).  These  results 
suggest  that  the  variance  in  EQ  scores  is  best  explained 
by  one  primary  factor,  although  a  small  percentage  is 
explained  by  a  related  factor.  For  the  most  part,  the  EQ 
scales  are  related  to  each  other  even  when  they  should 
theoretically  be  distinct.  The  results  of  this  principal 
components  analysis  fail  to  provide  support  for  the 
independence  of  the  EQ  scale  scores. 

Further  support  for  the  construct  validity  of  the  EQ 
was  sought  by  comparing  scale  scores  to  archival  16PF 
scores.  Although  the  1 6PF  is  not  necessarily  the  standard 
by  which  all  personality  tests  are  measured,  it  is,  in  fact, 
an  established  measure  of  personality  traits  that  is  widely 
used  in  clinical  and  experimental  settings.  The  merging 
of  these  two  data  bases  resulted  in  45 1  usable  cases.  A 
description  of  the  16PF  scales  included  in  the  analyses  is 
provided  in  Table  6.9.  Certain  relationships  would  be 
expected  to  exist  between  scores  from  the  two  tests. 
Specifically,  there  would  be  support  for  the  construct 
validity  of  the  EQscales  if  they  correlate  with  1 6PF  scales 
that  measure  a  similar  construct.  Conversely,  the  EQ 
scales  would  be  expected  to  be  unrelated  to  16PF  scales 
that  measure  other  constructs.  Since  the  16PF  was 
administered  several  years  before  the  EQ,  these  expected 
relationships  are  based  on  the  assumption  that  measure¬ 
ment  of  personality  characteristics  remains  relatively 
stable  over  time.  This  assumption  is  supported  by  Hogan 
(1996)  and  Costa  &  McCrae  (1988).  A  summary  of  the 
expected  relationships  between  EQand  1 6PF  scale  scores 
is  provided  below. 

The  EQ  Composure  scales  should  be  positively  cor¬ 
related  with  16PF  Factor  C  (emotionally  stable),  which 
would  indicate  that  people  high  in  composure  are  also 
more  emotionally  stable  and  calm.  EQTask  Closure  and 
EQ  Consistency  of  work  behavior  should  be  positively 


correlated  with  16PF  Factor  G  (conscientiousness).  EQ 
Working  Cooperatively  should  be  positively  correlated 
with  16PF  Factors  A  (outgoing  and  participating)  and 
Q3  (socially  precise)  as  well  as  negatively  correlated  with 
Factor  L  and  Factor  N  (which  would  indicate  that  these 
people  are  trusting  and  genuine) .  Furthermore,  it  would 
be  expected  that  a  high  score  on  EQ  Decisiveness  and 
EQ  Execution  would  be  negatively  correlated  with  1 6PF 
Factor  0,  meaning  that  decisive  people  would  also  be 
expected  to  be  self-assured  and  secure.  EQ  Flexibility 
should  have  a  positive  correlation  with  16PF  Factor  A 
and  a  negative  correlation  with  Factor  Q^  (relaxed). 

The  EQ  Tolerance  for  High  Intensity  scale  would  be 
expected  to  be  positively  correlated  with  1 6PF  Factor  H 
(Adventurous)  and  negatively  correlated  with  Factor  O 
(Apprehensive).  EQ  Self-Awareness  and  EQ  Self-Confi¬ 
dence  should  both  be  negatively  correlated  with  16PF 
Factor  O  (Apprehensive).  A  positive  correlation  between 
EQ  Self-Confidence  and  16PF  Factor  Q2  (Self-suffi¬ 
cient)  might  also  be  expected.  EQ  Sustained  Attention 
and  EQ  Concentration  should  be  related  to  16PF  Factor 
G  (conscientiousness)  whereas  EQTaking  Charge  should 
be  related  to  16PF  Factor  H  (Adventurous)  and  Factor 
E  (Assertive) .  Finally,  EQInterpersonal  T  olerance  should 
be  positively  correlated  with  16PF  Factor  I  (Tender- 
minded),  Factor  Q3  (socially  precise),  and  Factor  C 
(Emotionally  Stable). 

Scores  on  the  EQ  and  16PF  scales  were  compared 
using  Pearson  product-moment  correlations,  the  results 
of  which  are  presented  in  Table  6.10.  The  results  of 
correlational  analyses  between  the  EQscales  shows  that 
they  are  all  inter-related.  However,  this  is  not  surprising 
considering  the  results  of  the  principal  components 
analysis  described  above.  Although  relationships  be¬ 
tween  some  of  the  scales  contained  in  a  personality 
measure  are  not  unusual,  moderate  to  high  correlations 
between  all  of  the  scales  is  another  matter. 

As  stated  earlier,  the  EQ  scores  were  compared  to 
16PF  Factor  scores  in  an  effort  to  support  construct 
validity  by  determining  whether  or  not  these  scales 
measure  what  they  are  purported  to  measure.  Although 
statistically  significant,  the  correlations  between  EQand 
1 6PF  scales  represent  small  effect  sizes  and  are  not  of  the 
magnitude  desired  when  attempting  to  support  the 
validity  of  a  test.  The  statistical  significance  of  these 
relationships  is  most  likely  an  artifact  of  sample  size.  For 
the  most  part,  the  pattern  of  relationships  with  16PF 
scales  was  the  same  for  all  EQscales.  This  would  not  be 
expected  if  the  EQ  scales  did  in  fact  measure  different 
constructs.  This  pattern  is  not  unexpected  given  the  EQ 
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inter-scale  correlations  and  the  results  of  the  principal 
components  analysis.  The  results  of  these  analyses  fail  to 
provide  evidence  that  the  EQ  scales  measure  unique 
constructs,  let  alone  the  specific  constructs  they  are 
professed  to  measure.  However,  there  are  indications 
that  the  EQ  contributes  to  the  prediction  of  AT-SAT 
criterion  measures  (Houston  &  Schneider,  1997).  Con¬ 
sequently,  CAMI  will  continue  to  investigate  the  con¬ 
struct  validity  of  the  EQ  by  comparing  it  to  other 
personality  measures  such  as  the  NEO  PI-R. 

Regression  of  Archival  Selection  Procedures  and 
AT-SAT  Tests  on  AT-SAT  Criteria.  A  multiple  linear 
regression  analysis  was  conducted  to  assess  the  contribu¬ 
tion  of  the  AT-SAT  tests  in  predicting  the  AT-SAT 
criterion,  over  and  above  the  contribution  of  the  OPM 
rating  and  final  score  from  the  Nonradar  Screen  pro¬ 
gram.  The  regression  analysis  used  OPM  rating,  final 
score  in  the  Nonradar  Screen  program,  and  AT-SAT  test 
scores  as  predictors,  and  the  weighted  composite  of  AT- 
SAT  criterion  measures  as  the  criterion  variable.  To 
compute  the  weighted  composite  criterion  measure,  the 
CBPM  received  a  .6  weighting  while  the  AT-SAT  rating 
composite  received  a  .4  weighting.  A  stepwise  regression 
was  used. 

Table  6. 1 1  shows  the  results  of  the  analysis.  A  model 
was  identified  that  contained  the  following  variables: 
Analogies  Reasoning  score,  final  score  from  the  Nonradar 
Screen  program,  Applied  Math  Number  Correct,  Scan 
Total  score,  EQ  Unlikely  virtues  scale,  and  Air  Traffic 
Scenarios  Procedural  Accuracy  score  produced  a  mul¬ 
tiple  regression  coefficient  of  .465,  accounting  for  about 
22%  of  the  variance  in  the  AT-SAT  composite  criterion 
variable.  It  is  interesting  that  the  final  score  in  the 
Nonradar  Screen  program  contributed  so  much  to  the 
prediction  of  the  criterion  measure,  because  there  was 
considerable  restriction  in  the  range  of  that  variable.  At 
least  40%  of  the  candidates  failed  the  Nonradar  Screen 
program  and  were  removed  from  employment,  and 
another  10%  failed  field  training  and  were  also  removed 
from  employment  or  reassigned  to  another  type  of  air 
traffic  facility. 

It  may  appear  surprising  that  more  of  the  AT-SAT 
predictor  tests  were  not  included  in  this  model,  but  they 
probably  accounted  for  similar  parts  of  the  variance  in 
the  AT-SAT  composite  criterion  measure  that  were  also 
accounted  for  by  the  final  score  in  the  Nonradar  Screen 
program.  For  example,  the  Safety  and  Efficiency  scores 
from  Air  Traffic  Scenarios  Test,  Applied  Math,  Angles, 
the  Letter  Factories:  Number  of  letters  correctly  placed, 


Planning  &  Thinking  Ahead,  and  Situation  Awareness 
scores,  EQ:  Composure  &  Self-Confidence  scales  all  had 
significant  correlations  with  the  final  score  in  the 
Nonradar  Screen  program.  On  the  other  hand,  the 
Unlikely  Virtues  scale  from  the  EQ  probably  tapped  a 
part  of  the  variance  in  the  AT-SAT  composite  criterion 
measure  that  was  not  already  tapped  by  another  AT-SAT 
predictor  test  or  by  the  final  score  in  the  Nonradar  Screen 
program.  The  Unlikely  Virtues  scale  will  not  be  included 
as  part  of  the  selection  battery,  but  will  be  retained  to 
provide  information  about  whether  the  applicant  is 
faking  responses  on  the  rest  of  the  EQ  scales. 

Discussion 

Several  analyses  were  conducted  to  examine  interrela¬ 
tionships  between  archival  selection  tests,  archival  crite¬ 
rion  measures,  and  experimental  tests  administered  to 
candidates  entering  the  Academy  for  the  Nonradar  Screen 
program.  The  purpose  of  these  analyses  was  to  assess  the 
construct  validity  of  the  AT-SAT  criterion  measures  and 
predictors.  The  results  of  the  analyses  supported  the 
interpretation  of  the  AT-SAT  measures  discussed  in 
other  chapters  of  this  report. 

For  example,  the  amount  of  time  required  to  com¬ 
plete  various  phases  of  field  training,  which  were  used  as 
archival  criterion  measures,  were  related  to  the  AT-SAT 
rating  composite.  Also,  the  OPM  rating,  the  final  score 
in  the  Nonradar  Screen  program,  and  the  final  score  in 
the  Radar  Training  program,  were  all  significantly  cor¬ 
related  with  the  CBPM.  The  final  score  in  the  Nonradar 
Screen  program  and  the  final  score  in  the  Radar  Training 
program  were  both  significantly  correlated  with  the  AT- 
SAT  rating  composite.  Also,  the  component  tests  of  the 
OPM  Battery,  the  Nonradar  Screen  program,  and  the 
Radar  Training  program  all  had  significant  correlations 
with  the  CBPM.  Furthermore,  all  scales  from  the  Over- 
the-shoulder  rating  form  used  in  the  high-fidelity  simu¬ 
lation  (which  were  significantly  correlated  with  both  the 
CBPM  and  the  AT-SAT  rating  composite)  were  also 
significantly  correlated  with  both  the  Instructor  Assess¬ 
ment  and  Technical  Assessment  ratings  made  during 
both  the  Nonradar  Screen  program  and  the  Radar  Train¬ 
ing  program.  These  results  suggest  that  the  CBPM  and 
the  composite  ratings  are  related  to  measures  used  in 
the  past  as  criterion  measures  of  performance  in  air 
traffic  control. 

Additional  analyses  suggest  that  the  AT-SAT  predic¬ 
tors  are  also  related  to  tests  previously  used  to  select  air 
traffic  controllers.  The  MCAT  was  correlated  with  many 
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of  the  AT-SAT  predictor  tests,  especially  those  involv¬ 
ing  dynamic  activities.  The  Abstract  Reasoning  test 
had  a  particularly  high  correlation  with  the  Analogies 
Reasoning  score,  but  was  also  correlated  with  other 
AT-SAT  predictors. 

Other  tests,  administered  experimentally  to  air  traffic 
control  candidates  between  the  years  of  1981  and  1995, 
provided  additional  support  for  the  construct  validity  of 
AT-SAT  predictor  tests.  For  example,  the  Math  Apti¬ 
tude  Test  from  the  ETS  Factor  Reference  Battery 
(Ekstrom  et  al.,  1976),  the  Dial  Reading  Test,  and  a 
biographical  item  reporting  high  school  math  grades 
(which  was  previously  shown  to  be  correlated  with 
success  in  the  Nonradar  Screen  program)  had  high 
correlations  with  the  Applied  Math  Test.  The  Angles 
and  Dials  tests  were  also  correlated  with  Dial  Reading, 
Math  Aptitude,  and  the  biographical  item  reporting 
high  school  math  grades.  These  results  are  not  surprising, 
considering  the  consistent  relationship,  observed  over 
years  of  research,  between  aptitude  for  mathematics  and 
various  measures  of  performance  in  air  traffic  control. 

Finally,  a  multiple  linear  regression  analysis  was  con¬ 
ducted  which  showed  that  several  of  the  AT-SAT  tests 
contributed  to  the  prediction  of  the  variance  in  the  AT- 
SAT  composite  criterion  measure  over  and  above  the 
OPM  rating  and  the  final  score  in  the  Nonradar  Screen 
program.  The  OPM  battery  and  Nonradar  Screen  pro¬ 
gram  provided  an  effective,  though  expensive,  two-stage 
process  for  selecting  air  traffic  controllers  that  was  used 
successfully  for  many  years.  It  appears  that  the  AT-SAT 
battery  has  equivalent,  or  better,  predictive  validity  than 
did  the  former  selection  procedure,  and  costs  much  less 
to  administer.  Thus,  it  should  be  an  improvement  over 
the  previous  selection  process. 

To  maintain  the  advantage  gained  by  using  this  new 
selection  procedure,  it  will  be  necessary  to  monitor  its 
effectiveness  and  validity  over  time.  This  will  require 
developing  parallel  forms  of  the  AT-SAT  tests,  conduct¬ 
ing  predictive  validity  studies,  developing  and  validating 
new  tests  against  criterion  measures  of  ATC  performance, 
and  replacing  old  tests  with  new  ones  if  the  former  become 
compromised  or  prove  invalid  for  any  reason. 
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1.  True  north  and  magnetic  north  are  the  same. 

2.  There  is  an  airport  co-located  with  each  displayed  Navaid  except  CEN.  There  are  three  primary 

airports: 


Uptown:  UPT 

Downtown:  DWN 

Hubsville:  HUB 

FSS  only. 

VFR  tower. 

Hubsville  approach  owns  10,000 
and  below. 

IFR  approach  is  VOR  for 
RWY  27. 

IFR  approach  is  ILS  to  RWY  18. 

IAF  is  DOWNY. 

STAR:  north,  northwest,  &  west. 

Jet  arrivals  via  CENTR1  cross  at 

1 1 ,000  @  250  knots,  propellors 
cross  at  10,000;  HUB’s  control 
on  contact. 

Departures  via  V4/J4  climb  to 
10,000;  your  control  on  contact. 

Missed  approach  altitude  is 
3500. 

Missed  approach  altitiude  is 

2000. 

3.  “DPT”  indicates  a  departure  from  outside  depicted  airspace;  “DESTN”  indicates  an  arrival  at  an 
airport  outside  depicted  airspace. 

4.  Tick  marks  on  CENTR1  arrival  are  10  miles  apart,  and  airways  start  5  miles  from  the  Navaids. 

5.  Each  full  data  block  has  a  one  minute  velocity  vector  and  three  histories. 


Figure  4.2.  Airspace  Summary:  Sector  05  in  Hub  Center 
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Exjgrn pi e  3  Of  Uw  fssli uwn g  w Hot  sequence  sHd ukl 


Figure  4.3.  Example  CBPM  Item 
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Volunteers  Needed  to  Take  Air  Traffic  Controller  Tests 

Interested  in  air  traffic  controller  jobs?  We  need  volunteers  to  take  some  computer  administered  tests 
that  are  being  evaluated  for  use  in  selecting  future  controllers.  Volunteering  provides  a  preview  of  potential 
tests  for  future  controllers  and  in  no  way  affects  future  employment  as  a  controller.  Requires  8  hours,  including 
breaks,  and  a  meal  which  is  provided.  Tests  administered  in  June/July  1997.  Minimum  qualifications  for  taking 
tests  are:  US  citizenship,  ages  between  17  and  30,  AND  at  least  3  years  of  general  work  experience.  Please  call 
toll-free  1-888-322-2827 

Figure  5.2.1.  Sample  Classified  Newspaper  Advertisement  for  Soliciting  Civilian  Pseudo- Applicants 
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Interested  In  a  1-Day  Research  Experience  Involving 

Pre-Employment  Testing  for  Air  Traffic  Controllers? 

Air  traffic  controllers  provide  for  the  safe,  orderly,  and  expeditious  passage  of  airplanes  from  location  to 
location  along  established  airways.  In  doing  so,  air  traffic  controllers  use  sophisticated,  hi-tech  radar  and 
communications  systems  to  coordinate  with  pilots  and  other  air  traffic  controllers.  A  consortium,  under  contract 
to  the  Federal  Aviation  Administration  of  the  United  States  Department  of  Transportation,  is  currently 
evaluating  new  tests  that  are  being  considered  for  possible  use  in  the  coming  years  to  select  entry-level,  or  new, 
air  traffic  controllers.  Therefore,  those  interested  in  this  job  field  are  being  asked  to  volunteer  some  time  to  take 
the  computer-administered  tests. 

Minimum  Qualifications? 

United  States  citizenship,  AND  age  between  17  and  30 
(proof  is  required  at  time  of  testing),  AND  3  years  work 
experience  in  any  job  or  job  type.  College  may  be 
substituted  for  work  at  the  rate  of  1  year  of  college  for 

9  months  work  experience. 

When? 

Testing  will  occur  in  June/July  1997.  Please  call  toll-free  1-888-322-2827 
for  more  detailed  information. 

Where? 

Air  Route  Traffic  Control  Center,  (street  address), 

(city),  (state),  (zip  code). 

Time  &  Date? 

Please  call  toll-free  1-888-322-2827  to  schedule  a  time 
and  date  for  testing. 

How  Long  Will  It  Take? 

Approximately  8  hours 

What  Do  I  Bring? 

Valid  form  of  photo  identification,  such  as  a 
driver’s  license  or  passport 

Figure  5.2.2.  Sample  Flyer  Advertisement  for  Soliciting  Civilian  Pseudo-Applicants. 
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Raw  Examinee  Test 
Data  stored  in  weekly  DP 
directories 

23,107  ASCII  files  stored  in  22 
subdirectories 


Edited  Criterion 
Assessment  Rating  Sheets 

DSN  =  CAR 


Final  Analytic  Summary  Dataset 


- > 


Examinee  Level  Data  containing: 

•  scored  test  data  extracts 

•  complete  historical  data 

•  complete  biographical  information 

•  Keesler  ASVAB  data 

•  rater  ID  numbers 


DSN  =  XFINDAT4 


Edited  Rater  Biodata  Forms 

containing  biographical  data  for| 
assessors  only  &  participant 
raters 

DSN  =  XBRATER 


rj 


Examinee  Item  Level 

Scored  Test  Data 

DSNs  = 

1)  AMJTEMS 

8)  LAJTEMS 

2)  AN  ITEMS 

9)  ME  ITEMS 

3)  AT  ITEMS 

10)  MR  ITEMS 

4)  AY  ITEMS 

11)  PL  ITEMS 

5)  CRJTEMS 

12)  SC  ITEMS 

6)  Dl  ITEMS 

13)  SNJTEMS 

7)  EQ  ITEMS 

14)TW  ITEMS 

Scaled,  Imputed,  & 
Standardized  Test  Scores 

DSN  =  PREDSAlT 


"  ^  =  feeder  datasets  contributing  selected 

variables/observations. 


Examinee  Item  Level 
Scored  Test  Data 

DSNs" 

1JAMJTEMS 

2)  ANJTEMS 

3)  ASJTEMS 

4)  AT  ITEMS 

5)  AY  ITEMS 

6)  CLJTEMS 

7)  Dl  ITEMS 

8)  LA  ITEMS 


9)  LBJTEMS 

10)  MEJTEMS 

11)  MR  ITEMS 

12)  PLJTEMS 

13)  SCJTEMS 

14) SNJTEMS 

15)  TW  ITEMS 


- > 


I  Final  Summary  Data 


DSN  =  SUMMARY 


Dataset  names  listed  here 
contain  only  the  prefix 
portion.  Except  for  the  raw 
files,  the  suffix  portion  of 
the  name  is'POR" 
indicating  portable  SPSS 
files. 

Data  files  completely 
contained  within  another 
dataset  are  not  listed 
separately. 


Figure  5.3.1.  AT-SAT  Data  Base  (*) 
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Air  Traffic 
Selection  & 
Training  (AT-SAT) 
Database 


Readme.txt 

Alpha  Data 


Examinee  Item 
Level  Scored  Test 
Data 


1)  AMJTEMS.POR 

2)  ANJTEMS.POR 

3)  ASJTEMS.POR 

4)  ATJTEMS.POR 

5) AYJTEMS.POR 

6)  CLJTEMS.POR 

7)  DIJTEMS.POR 

8)  LAJTEMS.POR 


Final 

Summary  Data 


Beta  Data 


9)  LBJTEMS.POR 

10)  ME_1TEMS.P0R 

11)  MR  ITEMS. POR 

12)  PLJTEMS.POR 

13)  SCJTEMS.POR 

14)  SNJTEMS.POR 

15)  TW_ITEMS.POR 


■  1)  SUMMARY.POR 


Edited  Criterion 
Assessment  Rating 
Sheets 

1 . ->  1)  CAR.POR 

Edited  Rater 
Biodata  Forms 

1  »  1)XBRATER.POR 

Examinee  Item 
Level  Scored  Test 

_ _ Bato- 

I - ►  1)  AMJTEMS.POR 

2)  ANJTEMS.POR 

3)  ATJTEMS.POR 

4)  AYJTEMS.POR 

5)  CR„ITEMS.POR 

6)  DLITEMS.POR 

7)  EQJTEMS.POR 


Final  Analytic 
Summary  Data 


- ►I)  CBK1.DOC 

2)  CBK2.DOC 

3)  XFINDAT5.POR 

Raw  Examinee  Test 
Data  in  Weekly 
Subdirectories 


8)  LAJTEMS.POR 

9)  MEJTEMS.POR 

10)  MRJTEMS.POR 

11)  PLJTEMS.POR 

12)  SCJTEMS.POR 

13)  SN  ITEMS.POR 

14)  TWJTEMS.POR 


*“(22  subdirectories  containing  23,107  data  files) 


Scaled,  Imputed,  & 
Standardized  Test 
Scores _ 

"I"  ►  1)  PREDSALL  POR 


Figure  5.3.2.  CD-ROM  Directory  Structure  of  AT-SAT  Data  Base 
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%  rank 


50  55  60  65  70  75  80  85  90  95  100 
Predictor  Test  Score 


Validity  is  the  slope  of  the  line  showing  the  increase  in  average 
performance  associated  with  an  increase  in  test  scores. 

AT-SAT  has  a  much  higher  validity  than  the  old  OPM  test. 

Above  the  cut  scores,  AT-SAT’s  line  is  higher  than  OPM.  This 
means  that  the  selected  workforce  will  perform  better  when  AT-SAT 
is  used  to  screen  applicants. 


Figure  5.5.1.  Expected  Performance:  OPM  vs.  AT-SAT 
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Selection  Method 


Figure  5.5.2.  Percentage  of  Selected  Applicants  whose  Expected  Performance  is  in 
the  Top  Third  of  Current  Controllers:  OPM  vs.  AT-SAT 


Figure  5.6.1.  Fairness  Regression  for  Blacks  Using  AT-SAT  Battery  Score  and  Composite  Criterion 
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AT-SAT  Battery  Score 


Figure  5.6.2.  Fairness  Regression  for  Hispanics  Using  AT-SAT  Battery  Score  and  Composite  Criterion 
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Figure  5.6.3.  Fairness  Regression  for  Females  Using  AT-SAT  Battery  Score  and  Composite  Criterion 
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0.10- 


0.08- 


4? 

Q. 

o 

GO 


TO 


£T 

c 

o 


TO 

ID 


CO 

TO 

a: 


0.06- 


0.04- 


0.02- 


Best  estimate 
of  slope 


We  are  95%  sure  that  the  true 
slope  issomewhere  within  this 
range  in  the  population 


All 


T 

0.049 

1 


0.055 


Blacks  Whites  Hispanics 


0.00 - 

Notes .  Controller  sample  used.  The  dependent  variable  is  the  composite  criterion.  The  independent 
variable  is  the  composite  predictor  (scaled  such  that  the  cut-score  is  70  and  the  maximum  possible 
score  is  100).  Each  slope  value  represents  the  predicted  increase  in  the  criterion  score  for  a  unit  increase 
in  the  predictor  score.  For  the  criterion,  M  =  -.05,  SD  =  .83  for  the  entire  controller  sample.  For  the 
predictor.  M  =  72.4,  SD  =  7.9  for  the  entire  controller  sample. 

Figure  5.6.4.  Confidence  Intervals  for  the  Slopes  in  the  Fairness  Regressions 
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Expected  Score  Frequency  by  Applicant  Group 


All  Females  Hispanic  -■*-  Black 


=>  Best  estimates  show  mean  differences,  but  also  a  great  deal  of  overlap 


Figure  5.6.5.  Expected  Score  Frequency  by  Applicant  Group 
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Group  and  Recruiting  Strategy 


Figure  5.6.6.  Percent  Passing  by  Recruitment  Strategy 


16.2 


\° 


80 


FIGURES  AND  TABLES 


Table  4.1.  CBPM  Development  and  Scaling  Participants:  Biographical  Information 


CBPM 

Scenario/  Item 
Authors 

Initial  Scaling 
Participants 

Final  Scaling 
Participants 

Total  number  of  participants 

3 

9 

12 

Gender  (frequency): 

Male 

3 

8 

11 

Female 

0 

1 

1 

Race  (frequency): 

Black/African  American 

0 

1 

1 

Native  American/ American  Indian 

0 

2 

1 

Hispanic 

0 

2 

1 

White/Caucasian 

3 

4 

8 

Other 

0 

0 

1 

Job  title  (frequency): 

FAA  Academy  Instructor 

3 

4 

3 

Supervisor 

0 

4 

5 

Controller 

0 

1 

4 

Mean  agea 

33.67 

41.33 

40.58 

(2.08) 

(6.22) 

(7.66) 

Mean  time  as  an  FPL 

5.25 

10.94 

8.79 

(1.37) 

(4.20) 

(5.09) 

Mean  time  as  a  supervisor  -  Years 

.08 

2.30 

3.51 

(.14) 

(4.09) 

(6.34) 

Mean  time  as  an  instructor  -  Years 

6.47 

5.00 

6.62 

(1.91) 

(3.58) 

(5.52) 

“Standard  deviations  appear  in  parentheses 
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Table  4.2.  CBPM  Scaling  Workshops:  Interrater  Reliability  Results3 


Number  ofltems 

Initial  Scaling 
Group  lb 

Initial  Scaling 
Group  2C 

Final  Scaling 
Participants*1 

Scoring  Key  for 
84  Item"  CBPM 

Reliability  <  .10 

9 

7 

Between  .10  and  .19 

1 

4 

Between  .20  and  .29 

3 

1 

Between  .30  and  .39 

4 

2 

Between  .40  and  .49 

4 

8 

5 

1 

Between  .50  and  .59 

1 

3 

1 

Between  .60  and  .69 

8 

7 

1 

Between  .70  and  .79 

10 

12 

7 

1 

Between.80  and  .89 

18 

26 

22 

13 

Reliability  >  .90 

41 

29 

58 

46 

Total  Number  ofltems 

99 

99 

94 

61c 

“Reliabilities  are  k-rater  intraclass  correlation  coefficients;  these  coefficients  reflect  the  reliability  of  the  mean 

ratings. 

bN  =  4 


c  N  =  5 
dN=  12 

e61  items  required  the  panel  to  rate  the  effectiveness  of  each  response  option;  23  items  were  either  knowledge  or 
“confliction  prediction”  items  that  had  a  correct  answer. 
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Table  4.3.  Performance  Categories  for  Behavior  Summary  Scales 


A.  Coordinating 

Coordinating  with  other  controllers  to  minimize  traffic  problems;  coordinating  clearances,  changes  in 
aircraft  destinations,  altitudes,  etc.  as  appropriate;  initiating  and  receiving  handoffs  and  pointouts  in  an 
effective  manner;  presenting  the  rationale  for  instructions  to  pilots  or  other  controllers  as  necessary. 

B.  Communicating  and  Informing 

Using  clear,  concise,  accurate  language  to  get  message  across  unambiguously;  talking  only  when 
necessary  and  appropriate;  employing  proper  phraseology  to  ensure  accurate  communications;  notifying 
pilots/controllers/other  personnel  of  information  that  might  affect  them  as  appropriate;  issuing  advisories 
and  alerts  to  appropriate  parties;  providing  complete  and  accurate  position  relief  briefings;  providing 
accurate  and  legible  flight  strip  information;  listening  carefully  to  requests  and  instructions  (e.g.,  from 
pilots,  other  controllers)  and  ensuring  that  they  are  understood;  attending  to  readbacks  and  ensuring  that 
they  are  accurate. 

C.  Maintaining  Attention  and  Vigilance 

Scanning  properly  for  air  traffic  events,  situations,  potential  problems,  etc.;  keeping  track  of 
equipment/weather  status;  identifying  unusual  events,  improper  positioning  of  aircraft,  equipment 
malfunctions,  etc.;  recognizing  when  aircraft  have  potential  for  loss  of  separation;  verifying  visually  that 
control  instructions  are  followed;  recognizing  potential  problems  in  adjacent  sectors;  remaining  vigilant 
during  slow  periods. 

D.  Managing  Multiple  Tasks 

Keeping  track  of  a  large  number  of  aircraft/events  at  a  time;  conducting  two  or  more  tasks  simultaneously 
(e.g.,  issuing  instructions  while  scanning  the  screen;  monitoring  pilot  communications  while  writing  on 
strips);  remembering  and  keeping  track  of  aircraft  and  their  positions;  remembering  what  you  were  doing 
after  an  interruption;  returning  to  what  you  were  doing  after  an  interruption  and  following  through; 
providing  pilots  with  additional  services  as  time  allows. 

E.  Prioritizing 

Taking  early  or  prompt  action  on  air  traffic  problems  rather  than  waiting  or  getting  behind;  knowing  what 
to  do  first  and  which  are  the  most  important  situations  to  work  on;  recognizing  that  some  problems  or 
situations  are  less  important  and  can  wait;  preplanning  before  busy  periods;  organizing  the  board  and 
using  flight  strips  effectively  to  keep  priorities  straight  for  handling  air  traffic  situations;  quickly  and 
decisively  determining  appropriate  priorities. 

F.  Technical  Knowledge 

Knowing  the  equipment  and  its  capabilities  and  using  it  effectively;  knowing  aircraft  capabilities/limitations 
(speed,  wake  requirements,  size,  minimums)  and  using  that  knowledge;  keeping  up-to-date  on  letters  of 
agreement,  changes  in  procedures,  regulations,  etc.;  keeping  up-to-date  on  seldom  used  procedures  or  skills. 

(Continued) 
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Table  4.3  Performance  Categories  for  Behavior  Summary  Scales  (Continued) 


G.  Maintaining  Safe  and  Efficient  Air  Traffic  Flow 

Reacting  to  and  resolving  potential  conflictions  effectively  and  efficiently;  using  proper  air  traffic 
separation  techniques  effectively  to  ensure  safety;  sequencing  aircraft  effectively  for  arrival  or  departure; 
sequencing  aircraft  to  ensure  efficient/timely  traffic  flow;  controlling  traffic  in  a  manner  that  ensures 
efficient  traffic  flow;  controlling  traffic  in  a  manner  that  minimizes  traffic  problems  (e.g.,  conflictions, 
traffic  flow  problems)  for  other  controllers  and  pilots. 

H.  Reacting  to  Stress 

Remaining  calm  and  cool  under  stressful  situations;  handling  stressful  air  traffic  conditions  in  a 
professional  manner. 

I.  Teamwork 

Working  smoothly  with  supervisors  and  other  controllers  in  the  facility;  pitching  in  and  helping  other 
controllers  as  necessary;  accepting  and  reacting  constructively  to  appropriate  criticism  from  supervisors 
or  peers;  avoiding  arguments  and  interpersonal  conflicts  with  other  controllers,  supervisors,  or  pilots. 

J.  Adaptability/Flexibility 

Reacting  effectively  to  difficult  equipment  problems,  changes  in  weather,  traffic  situations,  etc.,  or  to 
unexpected  actions  on  the  part  of  other  controllers  or  pilots;  using  contingency  or  "fall-back"  strategies 
effectively  when  unforeseen/unanticipated  air  traffic  problems  emerge  or  if  first  plan  doesn’t  work;  asking 
for  help  when  it’s  needed;  developing/executing  innovative  solutions  to  air  traffic  problems;  dealing 
effectively  with  situations  for  which  there  may  not  be  clearly  prescribed  procedures,  situations  that 
require  novel  thinking;  adapting  to  equipment  updates,  new  kinds  of  procedures,  etc. 


Table  4.4.  Pilot  Test  Results:  Computer-Based  Performance  Measure  (CBPM)  Distribution  of  Scores 


Percentage  of  Maximum  Score 

Number  of  Controllers 

69%  or  lower 

1 

70  -  74% 

1 

75  -  79% 

9 

80  -  84% 

36 

85  -  89% 

28 

90%  or  higher 

2 

Note.  N  =  77;  Mean  Score  (i.e.,  percentage)  =  84.4;  Standard  Deviation  =  4.0;  Coefficient  Alpha  Reliability  =  .53. 
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Table  4.5.  Pilot  Test  Results:  Means  and  Standard  Deviations  for  Ratings  on  Each  Dimension 


Supervisors3 

Peersb 

Supervisors  & 
Peersc 

Rating  Dimension 

Mean 

SD 

Mean 

SD 

Mean 

SD 

1. 

Maintaining  Safe  &  Efficient  Air  Traffic  Flow 

5.27 

1.11 

5.40 

1.09 

5.32 

1.01 

2. 

Maintaining  Attention  &  Vigilance 

4.89 

1.10 

5.25 

.95 

5.04 

.97 

3. 

Prioritizing 

5.25 

1.06 

5.29 

.99 

5.28 

.93 

4. 

Communicating  &  Informing 

5.11 

1.18 

5.35 

1.00 

5.25 

1.03 

5. 

Coordinating 

5.23 

1.06 

5.72 

.72 

5.46 

.86 

6. 

Managing  Multiple  Tasks 

5.23 

.98 

5.17 

1.14 

5.21 

.91 

7. 

Reacting  to  Stress 

4.85 

1.33 

5.18 

1.21 

4.92 

1.24 

8. 

Adaptability  &  Flexibility 

4.99 

1.21 

5.33 

.95 

5.12 

1.07 

9. 

Technical  Knowledge 

5.42 

1.14 

5.42 

1.11 

5.42 

.99 

10. 

Teamwork 

5.21 

1.32 

5.52 

1.06 

5.33 

1.10 

11. 

Overall  Effectiveness 

5.33 

.98 

5.47 

.88 

5.38 

.88 

aN  =  64 
bN  =  49 
CN  =  72 


Table  4.6.  Pilot  Test  Results:  Interrater  Reliabilities  for  Ratings1* 


Rating  Dimension 

Supervisor 

Reliabilities15 

Peer 

Reliabilitiesc 

Combined 

Supervisor/Peer 

Reliabilities11 

1.  Maintaining  Safe  &  Efficient  Air  Traffic  Flow 

.51 

.55 

.57 

2.  Maintaining  Attention  &  Vigilance 

.60 

.37 

.54 

3.  Prioritizing 

.63 

.49 

.55 

4.  Communicating  &  Informing 

.51 

.00 

.49 

5.  Coordinating 

.50 

.00 

.37 

6.  Managing  Multiple  Tasks 

.31 

.43 

.41 

7.  Reacting  to  Stress 

.47 

.28 

.61 

8.  Adaptability  &  Flexibility 

.65 

.43 

.58 

9.  Technical  Knowledge 

.48 

.51 

.53 

10.  Teamwork 

.45 

.59 

.47 

1 1 .  Overall  Effectiveness 

.57 

.57 

.62 

a  Reliabilities  are  k-rater  intraclass  correlation  coefficients;  these  coefficients  reflect  the  reliability  of  the  mean 
ratings. 

bN  =  64,  k  =  1.24 
cN  =  49,k=  1.30 
dN  =  72,k=  1.84 
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Table  4.8.  Rater-Ratee  Assignments 


Number  of  Supervisor 
Raters  /  Ratee 

N 

Number  of  Peer 
Raters  /  Ratee 

N 

Total  Number  of 
Raters  /  Ratee 

N 

0 

33 

0 

74 

1 

40 

1 

92 

1 

87 

2 

79 

2 

1064 

2 

1044 

3 

93 

3 

34 

3 

21 

4 

974 

4 

4 

4 

1 

5 

39 

6 

2 

Mean  number  of  supervisor  raters  per  ratee 

=  1.90 

Mean  number  of  peer  raters  per  ratee  =1.82 
Mean  total  number  of  raters  per  ratee  =  3.73 


Table  4.9.  Computer-Based  Performance  Measure  (CBPM):  Distribution  of  Scores  in  Validation  Sample 


Percentage  of  Maximum  Score 

Number  of  Controllers 

69%  or  lower 

5 

70  -  74% 

35 

75  -  79% 

214 

80  -  84% 

490 

85  -  89% 

280 

90%  or  higher 

22 

Note.  N  =  1046;  Mean  Score  (i.e.,  percentage)  =  82.68;  Standard  Deviation  =  4.17;  Coefficient  Alpha  Reliability  =  .63. 


Table  4.10.  Number  and  Percentage  of  Supervisor  Ratings  at  Each  Scale  Point  in  the  Validation  Sample 


Rating  Scale  Point 
(1  =  Lowest) 

(7  =  Highest) 

Number  of  Ratings3 

Percentage  of  Ratings 

1 

130 

.51 

2 

646 

2.51 

3 

2336 

9.08 

4 

5605 

21.79 

5 

7569 

29.43 

6 

6727 

26.16 

7 

2683 

10.43 

Missing 

22 

.09 

aTotal  number  of  supervisor  ratings  across  all  10  dimensions  and  the  overall  performance  dimension. 
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Table  4.11.  Number  and  Percentage  of  Peer  Ratings  at  Each  Scale  Point  in  the  Validation  Sample 


Rating  Scale  Point 
(1  -  Lowest) 

(7  =  Highest) 

Number  of  Ratings3 

Percentage  of  Ratings 

1 

54 

.22 

2 

391 

1.58 

3 

1587 

6.44 

4 

4407 

17.87 

5 

7452 

30.22 

6 

7505 

30.43 

7 

3263 

13.23 

Missing 

3 

.01 

aTotal  number  of  peer  ratings  across  all  10  dimensions  and  the  overall  performance  dimension. 


Table  4.12.  Interrater  Reliabilities  for  Peer,  Supervisor  and  Combined  Ratings3 

Rating  Dimension 

Supervisor 

Reliabilities15 

Peer 

Reliabilities0 

Combined 

Supervisor/Peer 

Reliabilities*1 

1.  Maintaining  Safe  &  Efficient  Air  Traffic 

Flow 

.60 

.57 

.69 

2.  Maintaining  Attention  &  Vigilance 

.51 

.49 

.63 

3.  Prioritizing 

.50 

.46 

.60 

4.  Communicating  &  Informing 

.45 

.43 

.57 

5.  Coordinating 

.43 

.32 

.50 

6.  Managing  Multiple  Tasks 

.55 

.47 

.62 

7.  Reacting  to  Stress 

.54 

.53 

.65 

8.  Adaptability  &  Flexibility 

.55 

.48 

.61 

9.  Technical  Knowledge 

.48 

.44 

.60 

10.  Teamwork 

.52 

.48 

.63 

1 1 .  Overall  Effectiveness 

.60 

.56 

.69 

a  Reliabilities  are  k-rater  intraclass  correlation  coefficients;  these  coefficients  reflect  the  reliability  of  the  mean 
ratings. 

bN=  1194,  k=  1.88 
CN  =  1153,  k=  1.87 
dN=  1227,  k  =  3.39 
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Table  4.13.  Means  and  Standard  Deviations  for  Mean  Ratings  on  Each  Dimension 


Supervisors3 

Peers 

b 

Supervisors  & 
Peersc 

Rating  Dimension 

Mean 

SD 

Mean 

SI) 

Mean 

SD 

1. 

Maintaining  Safe  &  Efficient  Air  Traffic  Flow 

5.07 

1.05 

5.31 

.97 

5.18 

.89 

2. 

Maintaining  Attention  &  Vigilance 

4.93 

1.02 

5.15 

.94 

5.03 

.86 

3. 

Prioritizing 

5.03 

.97 

5.20 

.91 

5.11 

.81 

4. 

Communicating  &  Informing 

4.89 

1.02 

5.12 

1.00 

5.00 

.86 

5. 

Coordinating 

5.18 

.93 

5.46 

.83 

5.30 

.74 

6. 

Managing  Multiple  Tasks 

4.98 

1.05 

5.12 

.98 

5.05 

.87 

7. 

Reacting  to  Stress 

4.72 

1.19 

5.03 

1.16 

4.88 

1.03 

8. 

Adaptability  &  Flexibility 

4.81 

1.10 

5.12 

.99 

4.96 

.89 

9. 

Technical  Knowledge 

5.10 

.97 

5.22 

.94 

5.15 

.83 

10. 

Teamwork 

5.00 

1.17 

5.22 

1.11 

5.09 

.99 

11. 

Overall  Effectiveness 

5.02 

.95 

5.32 

.85 

5.16 

.80 

aN  =  1194 
bN  =  1153 
CN  =  1227 
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Table  4.15.  Factor  Analysis  Results  for  Performance  Ratings 


Rating  Dimension 

Factor  1 

Loadings 

Factor  2 

Factor  3 

1.  Maintaining  Safe  &  Efficient  Air  Traffic  Flow 

.73 

.54 

.16 

2.  Maintaining  Attention  &  Vigilance 

.33 

.79 

.26 

3.  Prioritizing 

.68 

.59 

.16 

4.  Communicating  &  Informing 

.30 

.65 

.50 

5.  Coordinating 

.40 

.70 

.33 

6.  Managing  Multiple  Tasks 

.82 

.44 

.09 

7.  Reacting  to  Stress 

.79 

.14 

.43 

8.  Adaptability  &  Flexibility 

.82 

.30 

.33 

9.  Technical  Knowledge 

.25 

.80 

.13 

lO.Teamwork 

.27 

.30 

.87 

Eigenvalue 

6.75 

.81 

.65 

%  Variance 

67.5 

8.1 

6.5 

Note.  Sample  size  was  1227.  Principal  components  analysis  with  varimax  rotation  was  used.  Factor  names:  1. 
Technical  Performance  -  Activities  directly  related  to  separating  aircraft;  2.  Technical  Effort-  Activities  in  support 
of  controlling  aircraft;  3.  Teamwork 


Table  4.16,  Descriptive  Statistics  of  High  Fidelity  Performance  Measure  Criterion  Variables _ 

N  Mean  SI) 


OTS  Dimensions: 


1.  Maintaining  Separation 

107 

3.98 

1.05 

2.  Maintaining  Efficient  Air  Traffic  Flow 

107 

4.22 

.99 

3.  Maintaining  Attention  and  Situation  Awareness 

107 

3.66 

1.02 

4.  Communicating  Clearly,  Accurately,  and  Efficiently 

107 

4.61 

.96 

5.  Coordinating 

107 

4.17 

.97 

6.  Performing  Multiple  Tasks 

107 

4.40 

1.00 

7.  Managing  Sector  Workload 

107 

4.39 

1.03 

Behavioral  &  Event  Checklist: 

8.  Operational  Errors 

107 

.05 

.04 

9.  Operational  Deviations 

107 

.11 

.07 

10.  Failed  to  Accept  Handoff 

107 

.31 

.46 

1 1.  LO A/Directive  Violations 

107 

2.42 

1.26 

12.  Readback/Hearback  Errors 

107 

.46 

.44 

13.  Failed  to  Accomodate  Pilot  Request 

107 

.45 

.33 

14.  Made  Late  Frequency  Change 

107 

.44 

.43 

15.  Unnecessary  Delays 

107 

2.68 

1.56 

16.  Incorrect  Information  in  Computer 

107 

1.04 

.66 
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Table  4.17.  Interrater  Reliabilities3  for  OTS  Ratings  (N=24) 


Dimension 

Median 

Range 

1.  Maintaining  Separation 

.95 

.83  to  .98 

2.  Maintaining  Efficient  Air  Traffic  Flow 

.89 

.71  to  .94 

3.  Maintaining  Attention  and  Situation  Awareness 

.83 

.79  to  .87 

4.  Communicating 

.91 

.88  to  .93 

5.  Coordinating 

.91 

.86  to  .96 

6.  Managing  Multiple  Tasks 

.88 

.82  to  .93 

7.  Managing  Sector  Workload 

.91 

.85  to  .95 

8.  Overall 

.95 

.88  to  .97 

“Reliabilities  are  2-rater  intraclass  correlation  coefficients;  these  coefficients  reflect  the  reliability  of  the  mean 
ratings. 


Table  4.18.  Principal  Components  Analysis  of  the  High  Fidelity  Criterion  Space 


Criterion  Measures 

Component  1 

Component  2 

Core  Technical  Proficiency 

OTS:  Maintaining  Separation 

.95 

.05 

OTS:  Coordinating 

.87 

-.12 

BEC:  Operational  Errors 

-.85 

-.36 

OTS:  Maintaining  Attention/A  wareness 

.83 

-.20 

OTS:  Performing  Multiple  Tasks 

.81 

-.27 

OTS:  Managing  Sector  Workload 

.80 

-.29 

OTS:  Communicating 

.79 

-.27 

OTS:  Maintaining  Efficient  Air  Traffic  Flow 

.78 

-.30 

BEC:  LO A/Directive  Violations 

-.76 

-.07 

BEC:  Operational  Deviations 

-.59 

.05 

Poor  Sector  Management 

BEC:  Incorrect  Information  in  Computer 

.10 

.72 

BEC:  Readback/Hearback  Errors 

-.01 

.63 

BEC:  Make  Late  Frequency  Changes 

-.13 

.60 

BEC:  Fail  to  Accommodate  Pilot  Requests 

-.27 

.54 

BEC:  Unnecessary  Delays 

-.45 

.53 

BEC:  Fail  to  Accept  Handoffs 

-.37 

.45 

Percent  Variance  Accounted  For 

59 

9 

93 
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Table  4.20.  Job  Analysis-Item  Linkage  Task  Results  for  CBPM  and  HFPM 


Sub-Activities  from  Job  Analysis 

Number  of 
CBPM  Items 

HFPM  Scenario/Item  Numbers 

1.  Checking  and  evaluating  separation  or  traffic 
movement  to  ensure  separation  is  maintained 

30 

all  scenarios 

2.  Performing  aircraft  conflict  resolution 

18 

all  scenarios 

3.  Establishing  and  maintaining  positive  aircraft  or 
vehicle  identification 

25 

all  scenarios 

4.  Establishing  arrival  sequences 

5 

all  scenarios 

5.  Issuing  clearances 

24 

all  scenarios 

6.  Responding  to  special  conditions,  unusual  airspace  or 
aircraft  operation 

15 

scenarios  1, 2,  3,  5,  7 

7.  Prioritizing  sector/position  tasks 

32 

all  scenarios 

8.  Responding  to  pointouts  based  on  current  or 
anticipated  traffic  situations 

1 

scenarios  2,  4,  6,  7 

9.  Initiating  pointouts 

3 

scenarios  1,  2,  3,  7 

10.  Assuming  position  responsibility 

0 

all  scenarios 

11.  Scanning  to  maintain  awareness  of  surrounding 
airspace 

13 

all  scenarios 

12.  Managing  personal  workload 

30 

all  scenarios 

13.  Briefing  relieving  controllers 

8 

scenarios  1,  2,  3,  5,  6,  7 

14.  Establishing/maintaining/terminating  radio 
communications 

27 

all  scenarios 

15.  Recognizing  and  responding  to  deviations  from  ATC 
instructions/clearances 

3 

scenarios  4,  5,  6 

16.  Performing  procedures  for  non-radar  environment 

4 

scenarios  1 , 2,  3,  4,  6,  7 

17.  Managing  departure  Hows 

1 

scenarios  2  -  7 

18.  Responding  to  computer  failures 

0 

scenario  5 

19.  Orienting  lost  aircraft 

0 

none 

20.  Establishing/re-establishing/terminating  radar 
identification 

11 

scenarios  1  -  7 

21.  Reviewing  route  of  flight 

33 

scenarios  1  -  7 

22.  Issuing  arrival  and  landing  information  or  instructions 

14 

scenarios  1  -  7 

23.  Issuing  departure  information  or  instructions 

4 

all  scenarios 

24.  Responding  to  communications  failures 

4 

scenarios  2,  3,  4 

25.  Executing  backup  procedures  for  radar  display  failures 

0 

scenario  5 

26.  Managing  departure  traffic 

11 

all  scenarios 

27.  Processing  flight  plans  or  flight  plan  amendments 

25 

scenarios  1  -  7 

28.  Executing  backup  procedures  for  facility 
communications  failures 

2 

scenarios  2,  4 
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Sub-Activities  from  Job  Analysis 

Number  of 
CBPM  Items 

HFPM  Scenario/Item  Numbers 

29.  Planning  clearances 

30 

all  scenarios 

30.  Initiating  search  and  rescue  procedures 

1 

scenarios  2,  5 

31.  Updating  flight  progress  strips 

29 

all  scenarios 

32.  Conducting  search  and  rescues  procedures 

1 

none 

33.  Initiating  transfer  of  control  or  radar  identification 

10 

scenarios  1-7 

34.  Receiving  transfer  of  control  or  radar  identification 

8 

scenarios  1  -  7 

35.  Performing  minimum  safe  altitude  processing 

10 

all  scenarios 

36.  Analyzing  initial  requests  for  clearances 

12 

all  scenarios 

37.  Responding  to  traffic  management  constraints  or  flow 
control  conflicts 

3 

scenarios  2,  4,  7 

38.  Disseminating  weather  information  to  pilots/other 
controllers 

8 

all  scenarios 

39.  Responding  to  imposed  airspace  restrictions 

5 

scenarios  1,  7 

40.  Responding  to  significant  weather  information 

4 

scenarios  2-4 

Note.  Sub-activities  are  ranked  according  to  their  mean  criticality  for  en  route  controllers. 

Overall  number  of  CBPM  items  tapping  into  sub-activities:  Mean:  11.60;  Standard  Dev.:  10.94 

Overall  number  of  subactivities  per  CBPM  item:  Mean:  12.21;  Standard  Dev.:  5.46 

Overall  number  of  subactivities  appearing  in  each  HFPM  scenario :  Mean:  24.62;  Standard  Dev.:  2.38 

Overall  number  of  HFPM  scenarios  that  a  subactivity  appeared  in  ( out  of  7):  Mean:  4.31;  Standard  Dev.:  2.67 


Table  5.2.1. 1990-1992  Profile  Analysis  of  Actual  FA  A  ATCS  Applicants1 


Gender 

Race/Ethnicity 

Educational  Level 

Test  Year 

Male  80.8% 

Native  Am. 

0.5% 

<  High  school 

17.4% 

1990 

32.7% 

Female  19.1% 

Asian 

1.2% 

Some  college 

51.0% 

1991 

51.9% 

Missing  0.1% 

Black 

3.6% 

Associate  degree 

0.6% 

19922 

14.0% 

Hispanic 

2.8% 

Bachelor  degree 

25.9% 

Missing 

1.4% 

White 

40.6% 

Graduate  work 

3.2% 

Missing 

51.3% 

Advanced  degree 

1.6% 

r  ^  . - r—y; — tP 

Missing 

0.4% 

T - ______ - 1 - 1 - - - — - - - I _ 

Data  provided  by  CAMI  via  OPM  records. 

2  1992  data  available  for  only  5,046  cases  compared  to  18,682  cases  for  1991  and  1 1,791  cases  for  1990. 
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Table  5.2.2  Bureau  of  Census  Data  for  Race/Ethnicity 


Race/Ethnicity 

Percentage 

White 

75.8% 

Black 

11.8% 

Hispanic 

8.8% 

Asian/Pacific  Islander 

2.8% 

Native  American 

0.8% 

Other 

0.1% 

TOTAL 

100.0% 

Table  5.2.3  Background  Characteristics  by  Testing  Samples 


Testing  Sample 

Gender 

Race/Ethnicity 

Air  Traffic  Controller 

83%  Male 

2.0%  Native  American 

(n=919) 

17%  Female 

0.6%  Asian/Pacific  Islander 

4.8%  African  American 

4.4%  Hispanic 

87.7%  Non-minority 

0.7%  Other 

0.0  %  Mixed 

Civilian  PA 

66%  Male 

7.5%  Native  American 

(n-262) 

34%  Female 

2.0%  Asian/Pacific  Islander 

10.2%  African  American 

11.0%  Hispanic 

66.1%  Non-minority 

2.8%  Other 

0.4  %  Mixed 

Military  PA 

70%  Male 

2.7%  Native  American 

(n=256) 

30%  Female 

4.3%  Asian/Pacific  Islander 

13.9%  African  American 

8.5%  Hispanic 

67.0%  Non-minority 

2.3%  Other 

0.4  %  Mixed 

NOTE:  There  were  166  missing  cases  for  the  gender  analysis  of  which  164  were  ATCSs;  170  missing  cases  for  the 
race/ethnicity  analysis  of  which  165  were  ATCSs. 
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Table  5.4.1:  Ethnicity  and  Gender  Of  all  Participants _ 

ETHNICITY  GENDER  TOTAL 


Male 

Female 

Native  American  /  Alaskan 

37 

12 

49 

Asian  /  Pacific  Islander 

14 

9 

23 

African-American 

120 

39 

159 

Hispanic 

84 

28 

112 

Caucasian 

990 

244 

1,234 

Multi-Racial 

2 

1 

3 

Other 

14 

7 

1 

TOTAL 

1,261 

340 

1,601 

Table  5.4.2.  Educational  Background  of  All  Participants 


Education  Level  Number  of 

_ Participants 

High  School  or  GED  227 

Attended  Trade  School(s)  14 

Completed  Trade  School  41 

Attended  College,  less  than  2  years  370 

Attended  College,  2  years  or  more  376 

Completed  College,  with  a  2-year  degree  109 

Completed  College,  with  a  4-year  degree  394 

Attended  Graduate  School  70 

TOTAL  *  1,601 


Table  5.4.3:  Data  Collection  Locations  for  All  Participants 


Facility 

No.  of 
Participants 

Facility 

No.  of 
Participants 

Albuquerque 

166 

Miami 

120 

Boston 

87 

Minneapolis 

82 

Denver 

148 

Atlanta 

100 

Fort  Worth 

114 

Chicago 

44 

Houston 

142 

Cleveland 

39 

Jacksonville 

104 

New  York 

6 

Kansas  City 

109 

Oakland 

5 

Los  Angeles 

91 

Washington,  D.C. 

22 

Memphis 

111 

Keesler  AFB 

262 

99 
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Table  5.4.5.  Air  Traffic  Controller  Sample  Educational  Background 


Education  Level 

Number  of 
Participants 

High  School  or  GED 

98 

Attended  Trade  School(s) 

7 

Completed  Trade  School 

24 

Attended  College,  less  than  2  years 

224 

Attended  College,  2  years  or  more 

271 

Completed  College,  with  a  2-year  degree 

79 

Completed  College,  with  a  4-year  degree 

324 

Attended  Graduate  School 

60 

TOTAL 

1,087 

Table  5.4.6:  Air  Traffic  Controller  Sample  from  Participating  Locations 


Facility 

No.  of 
Participants 

Facility 

No.  of 
Participants 

Albuquerque 

109 

Miami 

91 

Boston 

75 

Minneapolis 

76 

Denver 

118 

Atlanta 

77 

Fort  Worth 

93 

Chicago 

38 

Houston 

116 

Cleveland 

35 

Jacksonville 

99 

New  York 

6 

Kansas  City 

84 

Oakland 

5 

Los  Angeles 

82 

Washington,  D.C. 

22 

Memphis 

92 

Table  5.4.7.  Air  Traffic  Controller  Sample  Time  in  Current  Position 

Position 

No.  of 
Participants 

Average  Time 

Minimum 

Maximum 

Journeyman  Controller 

964 

8.86  years 

1  month 

31  years 

Developmental  Controller 

11 

2.61  years 

3  months 

6.67  years 

Staff 

47 

2.25  years 

1  month 

16.67  years 

Supervisor 

43 

2.25  years 

1  month 

20.50  years 

Other 

25 

4.71  years 

2  months 

23  years 

TOTAL 

1,090 

8.29  years 

1  month 

31  years 
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Table  5.4,8.  Air  Traffic  Controller  Sample  Job  Experience  at  any  Facility 


Position 

Years 

Months 

Developmental  Controller 

2.78 

3.53 

FPL  Controller 

7.31 

3.75 

Staff 

.63 

.92 

Supervisor 

.35 

.73 

Table  5.4.9.  Ethnicity  and  Gender  of  Pseudo-Applicant  Sample 


Ethnicity 

Gender 

Total 

Military* 

Male 

Female 

Native  American/Alaskan 

7 

0 

7 

Asian/Pacific  Islander 

6 

5 

11 

African-American 

28 

8 

36 

Hispanic 

17 

5 

22 

Caucasian 

120 

56 

176 

Multi-Racial 

1 

0 

1 

Other 

3 

3 

6 

Total 

182 

77 

259 

Civilian* 

Native  American/Alaskan 

11 

8 

19 

Asian/Pacific  Islander 

3 

2 

5 

African-American 

12 

13 

25 

Hispanic 

19 

11 

Caucasian 

119 

49 

168 

Multi-Racial 

0 

1 

1 

Other 

4 

3 

7 

Total 

168 

87 

255 

*  3  individuals  in  the  Military  PA  and  2  individuals  in  the  Civilian  PA  did  not  indicate  their  race;  two 
individuals  in  the  Civilian  PA  did  not  indicate  gender.  These  individuals  were  not  included  in  this  matrix. 
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Table  5.4.11.  CUE-Plus  Means  and  Standard  Deviations  by  Sample 
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Table  5.4.12.  Inter-Correlations  of  CUE-Plus  Items 
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Surf  the ‘Net  .561*  .562*  .442*  .512*  .371*  .590*  .437*  .456*  .605*  .567* 
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Table  5.4.13:  Item-Total  Statistics  for  CUE-Plus:  All  Respondents 


Item 

Scale  Mean  if 
Item  Deleted 

Scale  Variance  if 
Item  Deleted 

Corrected 

Item-Total 

Correlation 

Alpha  if 
Item 
Deleted 

I  read  computer  magazines 

49.17 

230.66 

.70 

.94 

I  know  how  to  recover  data 

48.98 

226.74 

.75 

.94 

I  know  what  a  LAN  is 

48.98 

228.07 

.63 

.94 

I  know  what  an  operating  system  is 

48.14 

225.82 

.75 

.94 

I  know  how  to  write  computer  programs 

49.46 

236.98 

.57 

.94 

I  know  how  to  install  software 

48.04 

222.56 

.78 

.94 

I  know  what  e-mail  is 

47.00 

247.27 

.53 

.94 

I  know  what  a  database  is 

47.51 

236.32 

.67 

.94 

I  am  computer  literate 

48.13 

226.66 

.86 

.94 

I  regularly  use  a  PC  for  word  processing 

48.19 

223.75 

.78 

.94 

I  often  use  a  mainframe  computer 

49.11 

241.70 

.41 

.95 

I  am  good  at  using  computers 

48.34 

227.98 

.85 

.94 

I  frequently  play  action  games 

48.90 

236.04 

.53 

.94 

I  regularly  use  a  PC  for  spreadsheets 

49.12 

231.89 

.68 

.94 

I  frequently  use  e-mail  to  exchange 
messages  or  information 

48.43 

222.96 

.75 

.94 

I  am  proficient  at  using  a  mouse  with  my 
computer 

47.52 

231.38 

.71 

.94 

I  frequently  “surf’  the  Internet 

48.52 

224.30 

.72 

.94 
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Table  5.4.14:  Varimax  and  Oblique  Rotated  Factor  Patterns  (CUE-Plus) 


Varimax  Rotation 

Oblique  Rotation 

ITEM 

Factor  1 

Factor  2 

Factor  1 

mat 

Single  Factor 
Specified 

Computer  magazines 

.749 

.253 

.837 

.730 

Recover  data 

.744 

.348 

.820 

.003 

.789 

What  a  LAN  is 

.633 

.301 

.697 

.008 

.675 

What  an  operating  system  is 

.593 

.528 

.623 

.267 

.794 

How  to  write  computer 
programs 

.742 

.008 

.850 

.281 

.610 

How  to  install  software 

.564 

.582 

.819 

What  e-mail  is 

.003 

.851 

.884 

.584 

What  a  database  is 

.739 

.277 

.626 

.724 

Computer  literate 

.648 

.611 

.677 

.328 

.890 

Use  a  PC  for  word  processing 

.570 

.583 

.590 

.336 

.813 

Use  a  mainframe  computer 

.441 

.187 

.489 

.002 

.455 

Good  at  using  computers 

.696 

.533 

.742 

.222 

.875 

Frequently  play  action  games 

.526 

.266 

.577 

.002 

.571 

Use  a  PC  for  spreadsheets 

.668 

.319 

.735 

.010 

.713 

Use  e-mail  to  exchange 
messages  or  information 

.515 

.595 

.526 

.376 

.780 

Proficient  at  using  a  mouse 
with  my  computer 

.385 

.713 

.361 

.564 

.759 

I  frequently  “surf  ’  the  Internet 

.512 

.563 

.525 

.344 

.756 

Table  5.4.15:  Eigenvalues  and  Variance  (CUE-Plus) 


Eigenvalue 

%  of  Variance 

Cumulative  % 

(from  Varimax) 

(from  Varimax) 

(from  Varimax) 

Factor  1 

9.17 

53.9 

53.9 

Factor  2 

1.07 

6.3 

60.2 
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Table  5.4.16.  CUE-Plus  Means,  S.D.  and  d-Score  for  Gender 


Males 

Females 

n 

Mean 

S.D. 

n 

Mean 

S.D. 

d 

All  Respondents 

1226 

52.26 

16.26 

340 

47.85 

15.02 

.21 

Controllers 

879 

51.74 

16.27 

176 

43.50 

16.25 

.42 

Military  PA 

179 

49.20 

16.52 

77 

47.27 

16.12 

Civilian  PA 

168 

58.23 

14.48 

87 

57.17 

11.94 

Table  5.4.17.  Means,  S.D.  and  d-Score  for  Ethnicity 


Comparison  Group 

Caucasian 

n 

Mean 

S.D. 

n 

Mean 

S.D. 

d 

African  American  / 

Caucasian 

All  Respondents 

159 

49.88 

14.95 

1197 

51.66 

16.14 

Controllers 

98 

49.16 

14.92 

858 

50.71 

16.30 

Military  PA 

36 

44.03 

13.35 

172 

50.12 

16.39 

.31 

Civilian  PA 

25 

61.12 

11.30 

167 

58.11 

13.43 

-.20 

Hispanic  /  Caucasian 

All  Respondents 

113 

50.50 

16.05 

1197 

51.66 

16.14 

Controllers 

61 

49.20 

17.14 

858 

50.71 

16.30 

Military  PA 

22 

46.41 

13.99 

172 

50.12 

16.39 

Civilian  PA 

30 

56.17 

14.00 

167 

58.11 

13.43 

.12 

Minority  /  Caucasian 

All  Respondents 

365 

50.04 

15.68 

1197 

51.66 

16.14 

.07 

Controllers 

196 

48.84 

15.71 

858 

50.71 

16.30 

Military  PA 

82 

45.39 

14.95 

172 

50.12 

16.39 

.18 

Civilian  PA 

87 

57.14 

13.97 

167 

58.11 

13.43 
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Table  5.4.21.  Correlations  between  CUE-Plus  and  Predictor  Battery:  Pseudo  Applicants 


15 


CUE-Plus  Total  .249**  .313**  .134**  .184**  .203**  .158*  .099*  .173**  .200**  .265**  .083 

*g<.o5  **2<-01 
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Table  5.4.22.  Determinants  of  Applied  Math  Test:  No.  of  Items  Correct 


Race:  Caucasians  and 
Minorities 

Race:  Caucasians  and  African- 
Americans 

Race:  Caucasians  and 
Hispanics 

Variable 

b 

Beta 

t 

b 

Beta 

t 

b 

Beta 

t 

Intercept 

7.12 

4.13 

3.16 

3.11 

Age 

.26 

.19 

4.28 

.24 

.17 

.30 

.21 

4.13 

Gender 

-3.47 

-.27 

-6.63 

-3.71 

-.28 

-3.55 

-.27 

-5.70 

Race 

2.24 

.18 

4.32 

3.50 

.20 

4.48 

1.66 

.09 

1.98 

Education 

.31 

.11 

2.32 

.41 

.14 

.37 

.12 

2.33 

CUE-Plus 

.06 

.15 

3.50 

.05 

.13 

2.81 

.05 

.13 

2.79 

Adj.  R 

Square 

.20 

.21 

.19 

Table  5.4.23.  Determinants  of  Angles  Test:  No.  of  Items  Correct 


Race:  Caucasians  and 
Minorities 

Race:  Caucasians  and  African- 
Americans 

Race:  Caucasians  and 
Hispanics 

Variable 

b 

Beta 

T 

b 

Beta 

t 

b 

Beta 

t 

Intercept 

22.04 

20.32 

20.33 

15.53 

21.16 

17.15 

Age 

n.s. 

n.s. 

n.s. 

Gender 

-2.31 

-.21 

-4.80 

-2.41 

-.21 

-4.40 

-1.96 

-.18 

-3.60 

Race 

1.51 

.14 

3.16 

2.60 

.18  j 

3.70 

n.s. 

Education 

.35 

.14 

3.08 

.40 

.16  | 

.50 

.20 

CUE-Plus 

.09 

2.02 

.04 

.12  ; 

2.32 

.05 

.15 

Adj.  R 

Square 

.09 

.12 

.10 

Table  5.4.24.  Determinants  of  Air  Traffic  Scenarios:  Efficiency 


Race:  Caucasians  and 
Minorities 

Race:  Caucasians  and  African- 
Americans 

Race:  Caucasians  and 
Hispanics 

Variable 

b 

Beta 

t 

b 

Beta 

t 

b 

Beta 

T 

Intercept 

41.30 

15.80 

37.36 

11.52 

46.39 

15.38 

Age 

n.s. 

n.s. 

n.s. 

Gender 

-7.01 

-.25 

-5.89 

-24 

-5.16 

-7.35 

-.26 

-5.30 

Race 

6.78 

HI 

wsmswm 

.30 

n.s. 

Education 

n.s. 

n.s. 

n.s. 

CUE-Plus 

.21 

.25 

5.97 

.20 

.23 

4.90 

.25 

.29 

5.94 

Adj .  R  Square 

.19  | 

.20 

.15 
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Table  5.4.25.  Determinants  of  Air  Traffic  Scenarios:  Safety 


Race:  Caucasians  and 
Minorities 

Race:  Caucasians  and  African- 
Americans 

Race:  Caucasians  and 
Hispanics 

Variable 

b 

Beta 

t 

b 

Beta 

t 

b 

Beta 

T 

Intercept 

39.84 

14.23 

41.19 

9.72 

41.29 

13.40 

Age 

n.s. 

n.s. 

n.s. 

Gender 

n.s. 

-3.58 

-.10 

-1.99 

n.s. 

Race 

3.11 

.09 

1.98 

6.83 

.15 

2.94 

n.s. 

Education 

n.s. 

n.s. 

n.s. 

CUE-Plus 

.16 

.15 

3.32 

.15 

.14 

2.80 

.19 

.18 

3.53 

Adj.  R  Square 

.03 

.05 

I  .03 

Table  5.4.26.  Determinants  of  Air  Traffic  Scenarios:  Procedural  Accuracy 


Race:  Caucasians  and 
Minorities 

Race:  Caucasians  and  African- 
Americans 

Race:  Caucasians  and 
Hispanics 

Variable 

B 

Beta 

T 

b 

Beta 

t 

b 

Beta 

t 

Intercept 

20.65 

3.57 

19.01 

2.80 

20.32 

3.14 

Age 

.61 

.13 

2.97 

.53 

.12 

2.30 

.77 

.17 

3.30 

Gender 

n.s. 

n.s. 

n.s. 

Race 

5.655 

.13 

2.90 

8.46 

.15 

2.91 

n.s. 

Education 

n.s. 

n.s. 

n.s. 

CUE-Plus 

.24 

.18 

4.07 

.26 

.19 

3.79 

.28 

.22 

4.26 

Adj .  R  Square 

.07 

.07 

.08 

Table  5.4.27.  Determinants  of  Analogy:  Information  Processing 


Race:  Caucasians  and 
Minorities 

Race:  Caucasians  and  African- 
Americans 

Race:  Caucasians  and 
Hispanics 

Variable 

b 

Beta 

b 

Beta 

ra 

b 

Beta 

t a 

114.66 

114.87 

51.88 

116.10 

52.15 

-.43 

-.23 

-4.81 

-.42 

-.23 

-4.16 

-.50 

-27 

-4.91 

Gender 

n.s. 

n.s. 

n.s. 

Race 

n.s. 

n.s. 

n.s. 

Education 

-.64 

-.16 

-3.32 

-.69 

-.17 

-3.12 

-.52 

-.13 

-2.41 

CUE-Plus 

n.s. 

n.s. 

n.s. 

Adj.  R  Square 

.10 

.11 

.11 
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Table  5.4.28.  Determinants  of  Analogy  Test:  Reasoning 


Race:  Caucasians  and 
Minorities 

Race:  Caucasians  and 
African-Americans 

Race: 

Caucasians  and 
Hispanics 

Variable 

b 

Beta 

t 

b 

Beta 

t 

h 

t 

Intercept 

11.94 

10.84 

6.99  ! 

2.54 

Age 

n.s. 

n.s. 

.26 

.15 

2.86 

Gender 

n.s. 

n.s. 

n.s. 

Race 

2.95 

4.57  | 

4.27 

.20 

4.27 

2.35 

.11 

2.27 

Education 

.91 

.26 

.94 

.26 

5.28 

.82 

.23 

4.21 

CUE-Plus 

.08 

.17 

3.82  | 

.07 

.15 

2.97 

.09 

.18 

3.71 

Adj.  R  Square 

.16  | 

.15 

.16 

Table  5.4.29.  Determinants  of  Dials  Test:  No.  of  Items  Correct 


Race:  Caucasians  and 
Minorities 

Race:  Caucasians  and 
African-Americans 

Race:  Caucasians  and 
Hispanics 

Variable 

b 

Beta 

T 

b 

Beta 

t 

B 

Beta 

t 

Intercept 

15.96 

30.59 

14.98 

24.18 

16.64 

32.83 

Age 

n.s. 

n.s. 

n.s. 

Gender 

-1.0 

WBBH 

— KB 

-.17 

-3.42 

IHE9 

-.21 

-4.15 

Race 

.84 

HQ 

■Em 

.22 

4.46 

.12 

2.37 

Education 

.11 

.09 

.11 

2.18 

.18 

.16 

3.11 

CUE-Plus 

.02 

.10 

2.15 

.02 

.11 

2.16 

n.s. 

Adj.  R  Square 

.08 

.10 

.07 

Table  5.4.30.  Determinants  of  Letter  Factory  Test:  Situational  Awareness 


Race:  Caucasians  and 
Minorities 

Race:  Caucasians  and  African- 
Americans 

Race:  Caucasians  and 
Hispanics 

Variable 

T 

t 

b 

Beta 

t 

b 

Beta 

t 

Intercept 

20.77 

14.17 

21.74 

10.82 

Age 

n.s. 

n.s. 

n.s. 

Gender 

n.s. 

n.s. 

n.s. 

Race 

4.04 

.22 

6.68 

.26 

5.49 

2.86 

.11 

2.19 

Education 

.57 

.14 

.53 

.12 

2.42 

.56 

.13 

2.56 

CUE-Plus 

.111 

.19 

.09 

.16 

3.42 

.12 

.20 

3.88 

Adj.  R  Square 

.12 

.12 

.07 
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Table  5.4.31.  Determinants  of  Letter  Factory  Test:  Planning  &  Thinking  Ahead 


Race:  Caucasians  and 
Minorities 

Race:  Caucasians  and  African- 
Americans 

Race:  Caucasians  and 
Hispanics 

Variable 

b 

Beta 

t 

b 

Beta 

t 

b 

Beta 

t 

Intercept 

.09 

7.79 

.07 

4.44 

.12 

8.61 

Age 

n.s. 

n.s. 

n.s. 

Gender 

n.s. 

n.s. 

n.s. 

Race 

.002 

.19 

4.42 

.06 

.28 

5.85 

n.s. 

Education 

n.s 

n.s. 

n.s. 

CUE-Plus 

.01 

.30 

6.99 

.001 

.27 

5.82 

.002 

.33 

6.71 

Adj .  R  Square 

.13 

.16 

.11 

Table  5.4.32.  Determinants  of  Scan  Test:  Total  Score 


BIB 

Race:  Caucasians  and 
Minorities 

Race:  Caucasians  and  African- 
Americans 

Race:  Caucasians  and 
Hispanics 

b 

Beta 

T 

b 

Beta 

t 

b 

Beta 

t 

157.63 

49.89 

147.07 

26.20 

145.43 

22.75 

Age 

n.s. 

n.s. 

n.s. 

Gender 

n.s. 

n.s. 

n.s. 

Race 

n.s. 

11.28 

.12 

2.30 

n.s. 

Education 

1.90 

.70 

2.73 

1.89 

.12 

2.29 

2.05 

.13 

2.53 

CUE-Plus 

n.s. 

n.s. 

.23 

.11 

2.12 

.01 

.02 

.03 
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Table  5.5,1.  Simple  Validities:  Correlations  Between  Predictor  Scores  and  Criteria 

Criterion 


Corrected  Uncorrected 

Correlation  Correlation 


Test 

Scale 

CBPM 

Rat¬ 

ings 

Comp¬ 

osite 

CBPM 

Rat¬ 

ings 

Comp¬ 

osite 

Predictor  Composite 

Scaled  Score 

70 

32 

68 

52 

21 

51 

AM:  Applied  Math 

Number  Correct 

59 

28 

58 

41 

18 

41 

AN:  Angles 

Number  Correct 

57 

19 

55 

35 

10 

33 

AT:  Air  Traffic 

Scenarios 

32 

15 

32 

Efficiency 

32 

16 

32 

30 

15 

31 

Procedural  Accuracy 

19 

09 

18 

14 

06 

13 

Safety 

26 

11 

25 

24 

10 

23 

AY:  Analogies 

43 

09 

38 

Info.  Proc.  Latency 

02 

00 

01 

02 

00 

01 

Info.  Proc.  Windows 

22 

06 

20 

21 

06 

19 

Reasoning 

42 

10 

38 

40 

09 

36 

DI:  Dials 

Number  Correct 

35 

09 

32 

27 

07 

24 

EQ:  Experiences 

Questionnaire 

All  scales 

16 

17 

18 

Final  scales 

09 

16 

14 

Dropped  scales 

05 

06 

00 

Composure 

11 

15 

15 

09 

13 

13 

Concentration 

09 

09 

11 

07 

07 

09 

Behavioral  Consistency 

07 

17 

14 

06 

16 

12 

Cooperation 

-07 

08 

-02 

-07 

08 

-02 

Decisiveness 

05 

11 

09 

04 

09 

07 

Execution 

04 

09 

07 

03 

08 

06 

Flexibility 

03 

07 

05 

03 

06 

05 

Tolerance  for  High  Intensity 

-02 

02 

-01 

-02 

02 

-01 

Self  Awareness 

05 

05 

07 

05 

05 

06 

Self  Confidence 

01 

11 

06 

01 

09 

05 

Sustained  Attention 

07 

06 

08 

06 

05 

07 

Taking  Charge 

-02 

03 

00 

-02 

03 

00 

Interpersonal  Tolerance 

00 

09 

05 

01 

10 

05 

Task  Closure 

01 

09 

05 

01 

07 

04 

LA:  Letter  Factory 

36 

11 

33 

Situational  Awareness 

38 

12 

35 

33 

10 

30 

Planning  &  Thinking  Ahead 

35 

12 

33 

32 

11 

30 

ME:  Memory 

Number  Correct 

24 

05 

21 

22 

05 

19 

MR:  Memory  Retest 

Number  Correct 

27 

09 

25 

25 

08 

23 
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- 

Criterion 

Test 

Scale 

Corrected 

Correlation 

Uncorrected 

Correlation 

Rat- 

CBPM  ings 

Comp¬ 

osite 

Rat- 

CBPM  ings 

Comp¬ 

osite 

PL:  Planes 

27 

08 

25 

Projection 

31 

08 

28 

20 

05 

18 

Visual/Spatial 

29 

07 

26 

15 

04 

14 

Timesharing 

21 

09 

20 

19 

08 

18 

SC:  Scan 

Number  Correct 

26 

08 

24 

21 

06 

19 

SN:  Sound 

Digits  Correct 

14 

05 

13 

16 

05 

14 

TW:  Time  Wall 

30 

08 

27 

Perceptual  Accuracy 

48 

15 

45 

23 

07 

21 

Perceptual  Speed 

12 

02 

10 

10 

02 

08 

Time  Estimation  Accuracy 

26 

09 

25 

22 

07 

21 

Notes.  Decimals  omitted.  N  =  984-1056.  Uncorrected  correlations  above  .04  are  significant  at  p<  .05.  Uncorrected 
correlations  above  .05  are  significant  at  p  <  .01.  Corrected  Correlations  are  corrected  for  range  restriction  in  the 
predictor;  they  are  estimates  of  what  the  correlations  would  be  in  an  applicant  sample.  The  scores  in  the  final  battery 
are  boldfaced.  The  multiple  correlations  are  corrected  for  shrinkage  (to  correct  for  overfitting  the  regression 
equation  to  the  sample). 
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Table  5.5.2.  Incremental  Validities:  Increases  in  Validities  when  Adding  a  Scale  or  Test 


_ Criterion _ _ 

All  Other  Final  Battery 

All  Other  Tests  Entered  Tests  Entered 

Rat-  Comp-  Rat-  Comp- 

Test _ Scale _ CBPM  ings  osite _ CBPM  ings  osite 


AM:  Applied  Math 

Number  Correct 

AN:  Angles 

Number  Correct 

AT:  Air  Traffic 
Scenarios 

Efficiency 

Procedural  Accuracy 

Safety 

AY:  Analogies 

Info.  Proc.  Latency 

Info.  Proc.  Windows 
Reasoning 

DI:  Dials 

Number  Correct 

EQ:  Experiences 
Questionnaire 

All  scales 

Final  scales 

Dropped  scales 

Composure 

Concentration 

Behavioral  Consistency 
Cooperation 

Decisiveness 

Execution 

Flexibility 

Tolerance  for  High  Intensity 
Self  Awareness 

Self  Confidence 

Sustained  Attention 

Taking  Charge 

Interpersonal  Tolerance 
Task  Closure 

LA:  Letter  Factory 

Situational  Awareness 
Planning  &  Thinking  Ahead 

Memory 

ME:  Memory 

MR:  Memory  Retest 

PL:  Planes 

Projection 

Visual/Spatial 

Timesharing 

SC:  Scan 

Number  Correct 

122 

125 

155 

126 

133 

163 

083 

005 

060 

084 

014 

057 

101 

088 

116 

109 

082 

118 

027 

068 

055 

028 

064 

054 

066 

034 

067 

on 

034 

076 

035 

014 

019 

032 

018 

015 

103 

035 

063 

141 

030 

101 

000 

OB 

007 

024 

024 

030 

046 

001 

034 

064 

001 

049 

093 

034 

053 

118 

022 

079 

048 

020 

027 

052 

019 

029 

102 

196 

136 

084 

166 

135 

073 

190 

124 

066 

071 

067 

061 

069 

061 

013 

060 

040 

019 

055 

042 

052 

019 

049 

035 

010 

032 

026 

088 

064 

019 
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077 

047 

037 

017 

061 

021 

035 

001 

008 

003 

007 

001 

005 

031 

027 

037 

018 

008 

018 

019 

054 

042 

030 

058 

052 

022 

039 

036 

019 

047 

039 

019 

019 

005 

022 

016 

009 

010 

051 

033 

004 

032 

013 

023 

059 

048 

021 

059 

046 

024 

047 

042 

034 

040 

046 

007 

018 

004 

031 

036 

006 

026 

002 

021 

029 

010 

027 

052 

024 

051 

054 

020 

051 

025 

002 

021 

030 

011 

028 

035 

022 

038 

031 

013 

030 

052 

041 

057 

054 

039 

058 

017 

036 

031 

031 

001 

024 

042 

041 

053 

050 

028 

052 

051 

032 

042 

051 

028 

040 

049 

007 

034 

051 

059 

093 

017 

000 

013 

010 

004 

006 

015 

030 

027 

016 

029 

027 

058 

022 

055 

076 

027 

071 
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Test 

Scale 

Criterion 

All  Other  Tests  Entered 

All  Other  Final  Battery 
Tests  Entered 

CBPM 

Rat¬ 

ings 

Comp¬ 

osite 

CBPM 

Rat¬ 

ings 

Comp¬ 

osite 

SN:  Sound 

Digits  Correct 

014 

000 

Oil 

015 

007 

008 

TW:  Time  Wall 

032 

017 

028 

034 

017 

029 

Perceptual  Accuracy 

021 

007 

013 

036 

008 

023 

Perceptual  Speed 

016 

018 

021 

031 

012 

029 

Time  Estimation  Accuracy 

010 

000 

007 

009 

004 

009 

Notes.  Decimals  omitted.  N  =  920  or  944.  Values  are  italicized  for  situations  in  which  the  score  had  a  negative 
regression  coefficient.  All  correlations  are  uncorrected.  The  scores  in  the  final  battery  are  boldfaced.  For  p  <  .05 
level  of  significance,  the  incremental  validity  for  a  single  scale  must  be  greater  than  about  .06  (the  critical  value 
varies  from  .055  to  .062,  depending  upon  the  column).  For  the  first  three  columns,  the  incremental  validity  indicates 
how  much  the  multiple  R  decreases  when  that  scale  is  removed  from  the  complete  set  of  scales.  For  the  last  three 
columns:  the  incremental  validity  indicates  how  much  the  multiple  R  increases  when  that  single  score  is  added  to 
the  final  AT-SAT  battery.  The  multiple  correlations  are  corrected  for  shrinkage  (to  correct  for  overfitting  the 
regression  equation  to  the  sample). 


Table  5.5.3.  Comparison  of  Five  Predictor  Weighting  Methods 


Predictor  Weighting  Method 


Statistic 

Regression 

Unit 

Validity 

Optimal 
low  d-score 

Combined 

Validity 

.521 

.463 

.501 

.435 

.506 

Validity  corrected  for  range  restriction 

.691 

.604 

.664 

.631 

.682 

Validity  corrected  for  range  restriction 
and  shrinkage 

.6661 

.604 

.6442 

.603 

.6633 

d-score  for  blacks  vs.  whites 

-.85 

-.92 

-.88 

-.55 

-.81 

Largest  t  for  difference  in  standardized 

-0.44 

-1.65 

-0.59 

0.02 

-0.58 

regression  slopes  between  racial/gender 
groups 

females 

females 

females 

females 

females 

1  This  validity’s  one-tailed  lower  confidence  limit  =  .607. 


2  The  correction  for  shrinkage  using  the  validity  weighting  method  is  likely  overcorrecting  to  a  moderate  extent. 
Thus,  the  best  estimate  of  this  value  is  likely  greater  than  .644  and  less  than  .664. 

3  The  correction  for  shrinkage  using  the  validity  weighting  method  is  likely  overcorrecting  to  some  small  extent. 
Thus,  the  best  estimate  of  this  value  is  likely  greater  than  .663  and  less  than  .682. 

Notes.  The  regression  and  unit-weighting  methods  used  all  35  scales.  The  other  weighting  methods  included 
only  the  26  scales  from  the  tests  retained  for  AT-SAT  Version  1.01.  Negative  values  of  t  indicate  that  the  female 
slope  was  lower  than  the  male  slope. 
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Table  5.5.4.  Validity  Coefficients  for  the  Predictor  Composite 


Criterion 

High  Fidelity  - 

High  Fidelity 

Controlling 

-  Core 

Traffic  Safely 

Composite 

CBPM 

Ratings 

Technical 

&  Efficiently 

N 

1029 

1032 

1053 

106 

106 

r  uncorrected 

.51 

.52 

.21 

.22 

.18 

r  corrected  for  range  restriction 

.68 

.70 

.32 

.33 

.28 

r  corrected  for  range  restriction 
and  shrinkage 

.66 

.68 

.22 

n/a1 

n/a1 

Criterion  reliability 

see  Notes 

see  Notes 

.71 

.95 

.99 

below 

below 

r  corrected  for  range  restriction 
and  criterion  unreliability 

Using  best  estimate  of 
reliability 

.76 

.78 

.38 

.34' 

.28' 

Using  upper  bound  estimate 

.70 

.74 

of  reliability 

Using  lower  bound  estimate 
of  reliability 

.79 

.84 

1  The  Hi  Fidelity  scores  were  not  used  to  determine  the  weights  used  in  the  predictor  composite  so  it  was  not 
appropriate  to  correct  for  shrinkage  due  to  capitalization  on  chance  in  the  estimation  of  the  predictor  weights. 

Notes.  Interrater  agreement  reliability  was  used  to  correct  the  validities  for  the  Ratings  and  HiFi  criteria.  Reliability 
for  the  CBPM  was  estimated  by  computing  its  internal  consistency  (coefficient  alpha  =  .59),  but  this  figure  is 
probably  an  underestimate  because  the  CBPM  appears  to  be  multidimensional  (according  to  factor  analyses).  Thus, 
three  different  reliabilities  were  used  to  correct  the  CBPM’s  validity  for  unreliability:  .8  (best  guess),  .9  (upper 
bound  estimate),  and  .7  (lower  bound  estimate),  respectively.  The  composite  criterion  reliability  was  estimated  as 
the  mean  of  the  ratings  and  CBPM  reliabilities. 
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Table  5.5.5.  Effect  of  Cut  Score  on  Predicted  Controller  Performance 


Applicant 

Screening  Model 

Screen  at  Cut 

Pass  All  Applicants  Score 

Cut  Score  on  Raw  Predictor  (Scaled  Cut  Score  =  70) 

none 

0.51 

Passing  Pseudo-Applicant  Demographics  (Number  in  Group  Passing  / 
Total  Number  Passing) 

N 

All 

511 

100% 

100% 

Male 

348 

68% 

Female 

162 

32% 

White 

339 

75% 

92% 

Black 

60 

13% 

2  % 

Hispanic 

51 

11  % 

6% 

Other/Missing  Race 

61 

Passing  Rates  of  Pseudo-Applicants  (Proportion  of  Each  Group 
Passing) 

All 

.22 

Male 

.26 

Female 

.14 

White 

.28 

Black 

.03 

Hispanic 

.12 

Relative  Passing  Rates  of  Pseudo-Applicants 

Female  (Relative  to  Males)  =  .  14  /  .26 

.54 

Black  (Relative  to  Whites)  =  .03  /  .28 

.11 

Hispanic  (Relative  to  Whites)  =  .12  /  .28 

.43 

Predicted  Criterion  Score  (as  Controller  z-score) 

At  the  Cut  Score 

-0.22 

Mean  For  Pseudo-Applicants  Passing 

-0.83 

0.19 

Predicted  Criterion  Score  Expressed  as  the  Percentile  Rank  on  the 
Current  Controller  Distribution 

At  the  Cut  Score 

41  % 

Mean  For  Pseudo-Applicants  Passing 

33% 

56% 

Proportion  of  Pseudo-Applicants 

Passing 

.22 

Passing  Above  Current  Controllers'  Mean  Criterion 

.23 

.59 

Descriptive  Statistics  for  Predicted  Criterion  Scores  among  Pseudo- 
Applicants  Passing 

Mean 

-0.86 

0.24 

Standard  Deviation  (adjusted  for  estimated  error  of  prediction) 

1.17 

0.88 

Descriptive  Statistics  for  Criterion  Scores  among  Controllers  Passing 

Mean 

0.00 

0.29 

Standard  Deviation 

1.00 

0.89 
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Applicant 

_ Screening  Model _ 

Screen  at  Cut 

_ Pass  All  Applicants  Score 

Notes.  The  criterion  scores  are  j-scores  on  the  current  Controller  distribution.  The  predictor  scores  are  the 
weighted  sum  of  the  --scores  based  on  the  Pseudo-Applicants’  distribution.  Passing  rates  shown  are  for  the 
Pseudo-Applicant  sample.  Actual  passing  rates  will  likely  differ  somewhat  because  (a)  the  small  sample  size 
of  some  groups  limits  the  accuracy  of  the  estimated  passing  rates  and  (b)  the  degree  of  correspondence 
between  the  Pseudo-Applicants  and  future  applicants  is  unknown. 


Table  5.5.6.  Expected  Performance  by  Validity  and  Selectivity 


Selection 
Cut  Point 

Screen 

Percent 

Selected 

#  Tested 
per  Hire 

Percent  High 
Performers 

N/A 

Current  Workforce 

33.3 

0.0 

None 

8.8 

70.0 

OPM  (r=.30) 

18.8 

5.3 

16.9 

70.0 

ATSAT  (r=.76) 

18.8 

5.3 

35.2 

75.1 

OPM  (r=.30) 

10.0 

10.0 

19.5 

75.1 

ATSAT  (r=.76) 

10.0 

10.0 

48.2 

99.0 

OPM  (r=.30) 

0.1 

1376.4 

37.0 

Note.  Percent  High  Performers  =  the  percentage  of  applicants  selected  whose  expected  job  performance  is  in  the  top 
third  of  current  controllers. 
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Table  5*6.1.  Means  for  AH  Scales  by  Sample,  Gender,  and  Race 


Means  for  each  _ Controllers _  Pseudo-Applicants 


Scale 

All 

Male 

Female 

White 

Black  Hispanic 

All 

Male 

Female 

White 

Black  Hispanic 

Composite 

Criterion 

-0.050 

0.018 

-0.371 

0.061 

-0.755 

-0.299 

CBPM  Criterion 

189.49 

190.39 

185.23 

191.67 

174.86 

184.53 

Criterion  Ratings 

5.08 

5.14 

4.83 

5.12 

4.95 

5.01 

Composite  Predictor 

72.37 

72.90 

69.78 

73.58 

64.26 

70.10 

58.82 

60.69 

55.05 

60.90 

50.27 

57.37 

AM:  Applied  Math 

21.69 

22.17 

19.34 

22.22 

18.21 

20.03 

14.41 

15.61 

11.91 

15.26 

11.35 

13.60 

AN:  Angles 

27.07 

27.20 

26.48 

27.41 

24.73 

26.73 

22.91 

23.72 

21.28 

23.48 

20.48 

22.35 

AT:  Air  Traffic 
Scenarios 

Efficiency 

59.36 

60.20 

55.02 

60.05 

53.11 

61.84 

47.94 

50.32 

42.80 

50.34 

38.38 

48.06 

69.64 

69.58 

70.49 

69.74 

69.09 

69.64 

51.75 

51.75 

51.99 

53.65 

44.83 

52.85 

Procedural 

Accuracy 

Safety 

65.83 

66.16 

63.62 

66.21 

61.66 

67.56 

50.40 

51.25 

48.47 

51.45 

44.28 

52.92 

AY:  Analogies 

Info.  Processing: 
Latency 

0.229 

0.229 

0.226 

0.230 

0.213 

0.236 

0.244 

0.246 

0.240 

0.248 

0.236 

0.240 

Info.  Processing: 
Windows 

15.66 

15.58 

16.25 

16.08 

12.71 

15.22 

14.33 

14.32 

14.40 

15.17 

11.90 

11.77 

Reasoning 

27.94 

27.77 

28.73 

28.81 

21.87 

26.35 

22.14 

22.21 

22.09 

23.29 

18.40 

20.73 

DI:  Dials 

17.33 

17.46 

16.70 

17.46 

16.26 

17.30 

16.44 

16.81 

15.69 

16.77 

15.11 

15.92 

EQ:  Experiences 
Questionnaire 


Composure 

72.93 

73.13 

71.41 

72.80 

73.76 

72.06 

69.67 

70.17 

68.57 

70.19 

68.20 

69.02 

Concentration 

74.39 

74.32 

74.19 

74.52 

73.50 

72.84 

72.92 

72.97 

72.89 

73.45 

71.36 

70.75 

Behavioral 

74.75 

74.66 

75.12 

74.81 

76.21 

72.93 

73.68 

73.61 

74.08 

73.68 

73.49 

73.47 

Consistency 

Cooperation 

73.67 

73.34 

75.40 

73.51 

76.20 

73.67 

79.15 

77.94 

81.82 

79.16 

77.95 

79.63 

Decisiveness 

76.49 

76.33 

76.97 

76.83 

75.20 

73.77 

72.06 

72.40 

71.49 

72.90 

69.40 

69.88 

Execution 

75.27 

75.05 

76.35 

75.28 

74.91 

75.77 

75.80 

75.95 

75.49 

76.48 

72.14 

75.58 

Flexibility 

76.60 

76.57 

76.58 

76.48 

77.76 

76.24 

74.38 

74.21 

74.86 

75.09 

71.05 

73.50 

Tolerance  for  High 

66.50 

66.00 

68.50 

66.40 

65.03 

67.40 

68.48 

68.16 

69.20 

69.33 

66.02 

67.22 

Intensity 

Self  Awareness 

74.14 

73.91 

75.30 

74.27 

73.85 

73.90 

74.59 

74.09 

75.64 

75.04 

73.69 

71.81 

Self  Confidence 

81.42 

81.70 

79.63 

81.33 

81.84 

80.26 

77.20 

78.17 

75.16 

77.76 

75.29 

75.86 

Sustained  Attention 

71.65 

71.64 

71.23 

71.75 

71.93 

69.37 

73.40 

73.90 

72.56 

74.15 

71.20 

72.20 

Taking  Charge 

75.80 

75.59 

76.30 

75.85 

74.36 

75.02 

76.79 

76.52 

77.30 

77.83 

71.41 

75.24 

Interpersonal 

74.96 

74.69 

76.48 

74.52 

79.97 

75.03 

78.43 

77.75 

80.10 

78.79 

78.15 

77.36 

Tolerance 

Task  Closure 

74.20 

73.54 

77.23 

74.29 

74.22 

71.64 

74.33 

73.75 

75.79 

74.60 

71.93 

74.05 

LA:  Letter  Factory 

Situational  Awareness 

35.84 

35.80 

36.14 

36.55 

31.52 

34.57 

31.43 

31.68 

30.96 

32.82 

25.87 

30.10 

Planning  &  Thinking 

0.232 

0.231 

0.233 

0.239 

0.179 

0.219 

0.199 

0.199 

0.200 

0.210 

0.142 

0.199 

Ahead 

ME:  Memory 

16.89 

16.57 

18.54 

17.19 

15.52 

16.15 

14.59 

14.32 

15.24 

14.68 

13.65 

14.61 

MR:  Memory  Retest 

15.61 

15.27 

17.22 

15.92 

13.72 

14.61 

12.94 

12.45 

14.02 

13.24 

12.20 

12.59 

PL:  Planes 

Projection 

41.77 

41.78 

41.79 

41.94 

40.54 

41.42 

38.79 

38.97 

38.42 

39.30 

36.34 

38.22 

Visual/Spatial 

44.72 

44.54 

45.59 

44.93 

43.94 

43.08 

41.07 

40.49 

42.39 

41.58 

38.82 

40.39 

Timesharing 

103.66 

103.59 

103.94 

104.21 

101.26 

100.52 

98.92 

99.14 

98.46 

99.41 

98.28 

97.37 

SC:  Scan 

178.01 

177.26 

181.10 

179.50 

169.21 

77.05 

164.85 

165.26 

164.32 

165.59 

153.78 

168.88 

SN:  Sound 

89.70 

90.17 

87.68 

89.25 

90.99 

93.20 

81.32 

82.89 

78.17 

83.61 

74.93 

80.43 
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Means  for  each 

Controllers 

Pseudo- Applicants 

Scale 

All 

Male 

Female 

White 

Black  Hispanic 

All 

Male  Female 

White 

Black  Hispanic 

TW:  Time  Wall 

Perceptual  Accuracy 

92.40 

92.44 

92.37 

92.76 

90.40 

91.13 

86.43 

86.31  86.88 

87.12 

81.28  87.01 

Perceptual  Speed 

51.24 

51.05 

52.25 

51.63 

49.11 

51.54 

51.31 

51.71  50.64 

52.06 

48.19  49.44 

Time  Estimation 
Accuracy. 

54.88 

55.93 

49.79 

55.61 

49.39 

53.45 

46.34 

47.41  44.15 

47.93 

40.09  44.40 

Table  5.6.2  Standard  Deviations  for  All  Scales  by  Sample,  Gender,  and  Race 


Standard  Deviations 
for  each  Scale 


Composite  Criterion 
CBPM  Criterion 
Criterion  Ratings 

Composite  Predictor 

AM:  Applied  Math 

AN:  Angles 

AT:  Air  Traffic 
Scenarios 
Efficiency 

Procedural 

Accuracy 

Safety 

AY:  Analogies 
Info.  Processing: 
Latency 

Info.  Processing: 
Windows 
Reasoning 

DI:  Dials 

EQ:  Experiences 
Questionnaire 
Composure 
Concentration 
Behavioral 
Consistency 
Cooperation 
Decisiveness 
Execution 
Flexibility 
Tolerance  for  High 
Intensity 
Self  Awareness 
Self  Confidence 
Sustained  Attention 
Taking  Charge 
Interpersonal 
Tolerance 
Task  Closure 

LA:  Letter  Factory 
Situational  Awareness 
Planning  &  Thinking 
Ahead 

ME:  Memory 


Controllers  Pseudo- Applicants 


All 

Male 

Female 

White 

Black  Hispanic 

All 

Male 

Female 

White 

Black 

Hispanic 

0.825 

0.811 

0.809 

0.786 

0.759 

0.859 

14.87 

14.65 

15.07 

13.93 

14.12 

14.40 

0.717 

0.698 

0.717 

0.709 

0.649 

0.764 

7.91 

7.72 

8.39 

7.04 

10.12 

7.23 

12.59 

12.28 

12.08 

12.76 

1 1.00 

10.78 

3.82 

3.39 

4.81 

3.26 

5.61 

4.60 

6.08 

6.05 

5.29 

6.25 

5.01 

5.58 

2.85 

2.78 

3.08 

2.43 

4.56 

2.48 

5.34 

4.94 

5.65 

5.22 

5.95 

5.34 

12.61 

12.53 

11.52 

12.18 

14.08 

11.20 

13.34 

13.65 

10.87 

13.27 

10.92 

12.49 

15.17 

14.83 

16.52 

14.99 

16.35 

15.67 

20.93 

20.58 

21.59 

20.34 

22.59 

21.40 

15.09 

15.13 

14.75 

14.84 

15.97 

15.08 

16.23 

16.45 

15.69 

15.95 

16.08 

17.45 

0.0441 

0.0454 

0.0378 

0.0432 

0.0459 

0.0519 

0.0422 

0.0426 

0.0415 

0.0399 

0.0474 

0.0482 

6.72 

6.80 

6.09 

6.65 

6.80 

6.06 

6.88 

6.83 

7.02 

6.85 

6.46 

6.56 

7.02 

7.11 

6.73 

6.66 

6.90 

7.37 

7.48 

7.64 

7.04 

7.70 

6.53 

7.01 

1.90 

1.75 

2.46 

1.83 

1.95 

2.09 

2.51 

2.27 

2.79 

2.31 

3.08 

2.69 

10.65 

10.67 

10.16 

10.70 

10.25 

10.51 

12.62 

12.49 

12.89 

12.58 

13.24 

12.27 

10.61 

10.58 

10.57 

10.45 

10.17 

12.56 

13.04 

13.46 

12.11 

13.07 

12.31 

10.65 

11.30 

11.41 

10.74 

11.23 

10.45 

13.43 

12.66 

12.78 

12.06 

12.66 

13.51 

11.74 

11.13 

11.01 

11.22 

11.13 

9.77 

12.19 

11.19 

1  1.26 

10.60 

10.89 

13.13 

8.58 

10.85 

10.77 

11.32 

10.64 

10.97 

12.17 

13.37 

13.28 

13.49 

13.30 

13.90 

12.13 

9.84 

9.91 

9.60 

9.79 

10.13 

9.75 

11.43 

1  1.37 

11.64 

11.08 

12.15 

10.55 

10.45 

10.30 

10.97 

10.45 

9.96 

10.07 

1 1.84 

11.68 

12.17 

11.83 

12.91 

10.13 

10.80 

10.83 

10.14 

10.86 

9.85 

10.97 

11.77 

11.85 

11.63 

11.43 

13.20 

9.47 

10.09 

9.94 

10.85 

10.10 

9.18 

10.91 

10.68 

11.10 

9.71 

10.50 

11.85 

9.26 

10.97 

10.79 

11.72 

10.90 

11.41 

11.82 

13.08 

12.89 

13.30 

12.86 

13.76 

1 1 .34 

11.39 

11.62 

10.22 

11.33 

11.69 

11.30 

13.34 

13.21 

13.36 

13.02 

15.39 

11.34 

10.77 

10.72 

10.81 

10.78 

10.54 

10.71 

11.61 

11.59 

11.69 

1 1.04 

12.47 

10.21 

12.26 

12.21 

12.27 

12.22 

9.93 

13.64 

11.63 

11.78 

10.89 

11.71 

12.01 

11.78 

10.84 

10.92 

10.06 

10.70 

11.22 

11.93 

13.00 

12.91 

12.88 

13.25 

13.04 

11.55 

7.48 

7.58 

6.89 

7.15 

7.91 

7.72 

8.94 

8.91 

9.00 

9.08 

6.84 

8.47 

0.0720 

0.0725 

0.0684 

0.0685 

0.0765 

0.0672 

0.0804 

0.0772 

0.0857 

0.0800 

0.0695 

0.0693 

5.17 

5.17 

4.73 

4.98 

5.43 

5.96 

5.71 

5.77 

5.50 

5.68 

6.29 

5.77 
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Standard  Deviations  _ Controllers _  Pseudo-Applicants 

for  each  Scale _ All  Male  Female  White  Black  Hispanic _ All  Male  Female  White  Black  Hispanic 


MR:  Memory  Retest 

PL:  Planes 
Projection 
Visual/Spatial 
Timesharing 

SC:  Scan 

SN:  Sound 

TW:  Time  Wall 
Perceptual  Accuracy 
Perceptual  Speed 
Time  Estimation 
Accuracy _ 


5.41 

5.43 

5.10 

3.02 

3.01 

3.10 

3.19 

3.24 

2.84 

9.72 

9.88 

9.19 

24.76 

25.56 

21.37 

21.98 

21.81 

21.40 

4.78 

4.86 

3.98 

6.44 

6.53 

5.82 

9.86 

9.74 

8.80 

5.27 

5.61 

6.24 

2.98 

3.27 

2.83 

2.98 

3.62 

4.65 

9.54 

10.21 

11.25 

22.59 

28.85 

29.95 

21.95 

22.22 

19.85 

4.17 

6.17 

7.55 

5.94 

7.71 

7.13 

9.29 

11.64 

10.32 

5.92 

5.92 

5.75 

4.89 

4.89 

4.89 

6.19 

6.54 

5.09 

10.86 

11.10 

10.37 

31.70 

29.47 

35.92 

20.18 

20.35 

19.15 

11.31 

11.44 

10.77 

8.12 

8.24 

7.63 

11.80 

12.08 

10.94 

5.91 

6.24 

6.12 

4.75 

5.68 

4.52 

6.00 

6.60 

6.38 

10.99 

10.37 

10.84 

32.78 

35.25 

26.38 

19.51 

20.35 

20.07 

11.37 

13.67 

10.45 

8.10 

10.05 

6.08 

11.66 

10.91 

13.09 

Table  5.6.3  Sample  Sizes  for  All  Scales  by  Sample,  Gender,  and  Race 


Ns  for  each  Controllers  Pseudo- Applicants 


Scale 

All 

Male 

Female 

White 

Black  Hispanic 

All 

Male 

Female 

White 

Black 

Hispanic 

Composite 

1043 

867 

171 

849 

92 

61 

0 

0 

0 

0 

0 

0 

Criterion 

CBPM  Criterion 

1046 

869 

172 

850 

94 

61 

0 

0 

0 

0 

0 

0 

Criterion  Ratings 

1227 

910 

176 

889 

96 

61 

0 

0 

0 

0 

0 

0 

Composite  Predictor 

1058 

866 

175 

851 

95 

60 

511 

348 

162 

339 

60 

51 

AM:  Applied  Math 

1060 

868 

175 

853 

95 

60 

519 

353 

165 

344 

62 

52 

AN:  Angles 

1059 

867 

175 

852 

95 

60 

518 

353 

164 

343 

62 

52 

AT:  Air  Traffic 
Scenarios 
Efficiency 

Procedural 

Accuracy 

Safety 

AY:  Analogies 
Info.  Processing: 
Latency 

Info.  Processing: 
Windows 
Reasoning 

DI:  Dials 


1012 

1012 

831 

831 

164 

164 

811 

811 

90 

90 

60 

60 

1012 

831 

164 

811 

90 

60 

1059 

867 

175 

852 

95 

60 

1059 

867 

175 

852 

95 

60 

1059 

867 

175 

852 

95 

60 

1062 

869 

175 

853 

96 

60 

498 

498 

343 

343 

154 

154 

331 

331 

57 

57 

50 

50 

498 

343 

154 

331 

57 

50 

512 

348 

163 

339 

60 

52 

512 

348 

163 

339 

60 

52 

512 

348 

163 

339 

60 

52 

518 

352 

164 

342 

62 

52 

EQ:  Experiences 
Questionnaire 


Composure 

1050 

860 

174 

848 

91 

60 

508 

345 

162 

339 

58 

51 

Concentration 

1048 

859 

173 

847 

90 

60 

507 

345 

161 

338 

58 

51 

Behavioral 

Consistency 

1049 

859 

174 

847 

91 

60 

504 

342 

161 

338 

57 

49 

Cooperation 

1047 

858 

173 

847 

90 

60 

504 

343 

160 

338 

57 

49 

Decisiveness 

1050 

860 

174 

848 

91 

60 

507 

344 

162 

339 

58 

51 

Execution 

1058 

867 

175 

852 

95 

60 

516 

351 

164 

342 

62 

51 

Flexibility 

1048 

859 

173 

847 

90 

60 

504 

342 

161 

338 

58 

49 

Tolerance  for  High 
Intensity 

1058 

867 

175 

852 

95 

60 

515 

350 

164 

342 

61 

51 
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Ns  for  each  Controllers  Pseudo-Applicants 


Scale 

All 

Male 

Female 

White 

Black 

Hispanic 

All 

Male 

Female 

White 

Black 

Hispanic 

Self  Awareness 

1058 

867 

175 

852 

95 

60 

513 

348 

164 

342 

59 

51 

Self  Confidence 

1056 

865 

175 

851 

94 

60 

511 

348 

162 

341 

58 

51 

Sustained  Attention 

1058 

867 

175 

852 

95 

60 

513 

348 

164 

342 

59 

51 

Taking  Charge 

1049 

859 

174 

847 

91 

60 

508 

345 

162 

339 

58 

51 

Interpersonal 

1050 

860 

174 

848 

91 

60 

508 

345 

162 

339 

58 

51 

Tolerance 

Task  Closure 

1047 

857 

174 

846 

91 

60 

499 

340 

158 

336 

56 

49 

LA:  Letter  Factory 

Situational  Awareness 

1059 

866 

175 

851 

95 

60 

516 

350 

164 

344 

60 

51 

Planning  &  Thinking 

1059 

866 

175 

851 

95 

60 

516 

350 

164 

344 

60 

51 

Ahead 

ME:  Memory 

1057 

865 

175 

850 

95 

60 

517 

352 

164 

343 

62 

51 

MR;  Memory  Retest 

1049 

859 

175 

847 

93 

59 

512 

348 

163 

340 

61 

51 

PL:  Planes 


Projection 

1053 

863 

175 

849 

94 

60 

512 

349 

162 

339 

61 

51 

Visual/Spatial 

1053 

863 

175 

849 

94 

60 

512 

349 

162 

339 

61 

51 

Timesharing 

1053 

863 

175 

849 

94 

60 

512 

349 

162 

339 

61 

51 

SC:  Scan 

1030 

841 

172 

834 

85 

59 

495 

337 

157 

332 

55 

48 

SN:  Sound 

1055 

862 

174 

845 

96 

60 

505 

346 

157 

337 

59 

51 

TW:  Time  Wall 


Perceptual  Accuracy 

1038 

847 

175 

833 

94 

60 

480 

323 

156 

319 

56 

47 

Perceptual  Speed 

1038 

847 

175 

833 

94 

60 

480 

323 

156 

319 

56 

47 

Time  Estimation 

1039 

848 

175 

833 

95 

60 

484 

325 

158 

321 

57 

48 

Accuracy 


Notes.  The  N  for  the  composite  predictor  is  greater  than  the  Ns  for  most  of  the  predictors  because  missing  predictor 
values  were  estimated  when  computing  the  composite  predictor.  Predictors  in  boldface  were  used  in  the  final  AT- 
SAT  battery. 
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Table  5.6.4.  Frequency  Table  for  Chi-Square  Test  of  Association  for  Predictor  Composite 


Observed  Frequencies 

Group 

Fail 

Pass 

Total  for  Group 

Males 

248 

100 

348 

Females 

137 

25 

162 

Total  for  Fail/Pass 

385 

125 

510 

Group 

Fail 

Pass 

Males 

263 

85 

Females 

122 

40 

X  =  10.574 

df=  1 

p  =  .0011 

Table  5.6.5.  Group  Differences  in  Means  and  Passing  Rates  for  the  Pseudo-Applicants 


Difference  in  Means 
Passing  Rate _  (in  Std.  Dev.  Units) 


Predictor 

Male 

Female 

White 

Black  Hisp 

anic 

Female 

Black 

Hispanic 

Composite  Predictor 

.28 

.15  <7** 

.30 

.03  77*** 

.14  77* 

_ 44  *** 

.  74  *** 

-.24 

AM:  Applied  Math 

.28 

.07  <?*** 

.27 

.05  77*** 

.12  77* 

-.61  *** 

-.63  *** 

-.27 

AN:  Angles 

.46 

.27  </*** 

.44 

.23  77** 

.35  " 

_  49  *** 

-.57  *** 

-.22 

AT:  Air  Traffic  Scenarios 

Efficiency 

.41 

.19  <7*** 

.40 

.11  77*** 

.40 

-.55  *** 

_  90  *** 

-.17 

Procedural  Accuracy 

.33 

.37 

.38 

.26  * 

.36 

.01 

-.43  ** 

-.04 

Safety 

.34 

.24  77* 

.32 

.21  * 

.34 

-.17 

-.45  ** 

.09 

AY :  Analogies 

Info.  Processing:  Latency 

.83 

.82 

.85 

.75 

.79 

-.14 

-.29  * 

-.20 

Info.  Processing:  Windows 

.64 

.61 

.67 

.52  77* 

.50  77* 

.01 

-.48  *** 

-.50  *** 

Reasoning 

.33 

.34 

.40 

.15  a*** 

.29  * 

-.01 

-.63  *** 

-.34  * 

DI:  Dials 

.76 

.60  77*** 

.75 

.55  «** 

.65 

_  49  *** 

-  72 

-.37  * 

EQ:  Experiences  Questionnaire 

Composure 

.57 

.51 

.56 

.50 

.55 

-.13 

-.16 

-.09 

Concentration 

.59 

.63 

.63 

.53 

.55 

-.01 

-.16 

-.21 

Behavioral  Consistency 

.61 

.65 

.63 

.58 

.65 

.04 

-.02 

-.02 

Cooperation 

.82 

.93 

.85 

.84 

.96 

.34 

-.11 

.04 

Decisiveness 

.54 

.51 

.57 

.47 

.45  * 

-.07 

-.26 

-.23 

Execution 

.73 

.69 

.73 

.58  77* 

.75 

-.04 

-.39  ** 

-.08 

Flexibility 

.61 

.60 

.63 

.52 

.57 

.06 

-.34  * 

-.13 
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Difference  in  Means 

Passing  Rate _  _ (in  Std.  Dev.  Units) 


Predictor 

Male 

Female 

White 

Black 

Hispanic 

Female 

Black 

Hispanic 

Tolerance  for  High  Intensity 

.79 

.80 

.82 

.70  * 

.82 

.09 

-.29  * 

-.19 

Self  Awareness 

.66 

.76 

.70 

.73 

.57 

.14 

-.13 

-.31  * 

Self  Confidence 

.58 

.47  * 

.56 

.55 

.45 

-.23  * 

-.19 

-.15 

Sustained  Attention 

.74 

.71 

.75 

.68 

.75 

-.10 

-.23 

-.15 

Taking  Charge 

.74 

.77 

.78 

.59  «** 

.78 

.07 

-.58  *** 

-.24 

Interpersonal  Tolerance 

.76 

.84 

.80 

.76 

.76 

.20 

-.06 

-.12 

Task  Closure 

.61 

.69 

.65 

.59 

.61 

.16 

-.20 

-.04 

LA:  Letter  Factory 

Situational  Awareness 

.53 

.51 

.58 

.17  a ** 

.53 

-.08 

_  77 

-.30  * 

Planning  &  Thinking  Ahead 

.49 

.51 

.55 

.25  «** 

.43  o 

.02 

-.86  *** 

-.15 

ME:  Memory 

.56 

.62 

.58 

.55 

.65 

.16 

-.18 

-.01 

MR:  Memory  Retest 

.50 

.59 

.55 

.49 

.51 

.27 

-.18 

-.11 

PL:  Planes 

Projection 

.57 

.48 

.58 

.34  </*** 

.45  " 

-.11 

-.62  *** 

-.23 

Visual/Spatial 

.43 

.51 

.48 

.30  </** 

.39 

.29 

-.46  ** 

-.20 

Timesharing 

.48 

.49 

.51 

.39  o 

.47 

-.06 

-.10 

-.19 

SC:  Scan 

.36 

.47 

.43 

.20  <7** 

.42 

-.03 

-.36  * 

.10 

SN:  Sound 

.54 

.42  </* 

.53 

.37  a* 

.51 

-.23  * 

-,44  ** 

-.16 

TW:  Time  Wall 

Perceptual  Accuracy 

.46 

.40 

.49 

.25  «*** 

.43 

.05 

_ 5  ]  *** 

-.01 

Perceptual  Speed 

.67 

.63 

.69 

.50  a** 

.53  «* 

-.13 

-.48  ** 

-.32  * 

Time  Estimation  Accuracy 

.44 

.27  «*** 

.45 

.14  a*** 

.31  o 

-.27  ** 

-.67  *** 

-.30 

For  passing  rates: 

a  The  passing  rate  for  this  group  is  less  than  80%  of  the  passing  rate  for  the  reference  group. 

For  t- test  of  the  difference  between  the  mean  scores  and  for  yj  test  of  the  difference  between  passing  rates  for  the 
minority  group  vs.  the  reference  group: 

*  p  <  .05 
**/?  <  .01 
***  p  <  .001 

Notes.  Ns  range  from  288-353  for  males,  140-165  for  females,  289-342  for  whites,  45-62  for  blacks,  and  41-52  for 
Hispanics.  Each  value  in  the  three  columns  on  the  right,  labeled  Difference  in  Means ,  represents  the  difference 
between  the  mean  score  for  the  minority  group  (i.e..  Female,  Black,  Hispanic)  and  the  mean  score  for  the  reference 
group  (i.e.,  Male,  White).  This  difference  is  expressed  in  standard  deviation  units  based  on  the  reference  group.  This 
is  often  referred  to  as  a  d-score.  A  negative  value  indicates  that  the  minority  group’s  mean  is  less  than  the  reference 
group’s  mean.  The  scores  used  in  the  final  battery  are  boldfaced.  Significant  differences  in  means  are  asterisked  only 
where  the  difference  favors  the  reference  group. 
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Table  5.6.6.  Fairness  Analysis  Results _ 

Regression  Lines’ 
Difference  at  Cut 


Standardized  Slope  of  Regression  Score 

Line  (i.e.,  validity  coefficient)  (in  Std.  Dev.  Units) 


Predictor 

Male 

Fem 

White 

Black 

Hisp 

Fem 

Black 

Hisp 

Predictor  Composite 

.50 

.47 

.44 

.50 

.46 

-.34 

-.59 

-.28 

AM:  Applied  Math 

.38 

.43 

.34 

.43 

.38 

-.25 

-.74 

-.27 

AN:  Angles 

.31 

.35 

.27 

.34 

.25 

-.46 

-.82 

-.41 

AT:  Air  Traffic  Scenarios 

Efficiency 

.30 

.24 

.25 

.46 

.30 

-.40 

-.96 

-.64 

Procedural  Accuracy 

.14 

.12 

.14 

.22 

.01 

-.51 

-1.11 

-.44 

Safety 

.25 

.09 

.21 

.26 

.25 

-.43 

-1.04 

-.56 

AY:  Analogies 

Latency  Info.  Proc. 

.00 

.04 

-.01 

.05 

-.17 

-.55 

-1.05 

-.36 

Windows  Info.  Proc. 

.20 

.20 

.17 

.23 

-.04 

-.58 

-.97 

-.35 

Reasoning 

.38 

.34 

.28 

.36 

.39 

-.67 

-.75 

-.41 

DI:  Dials 

.23 

.20 

.21 

.16 

.12 

-.39 

-.91 

-.39 

EQ:  Experiences 

Questionnaire 

Composure 

.13 

.09 

.14 

.08 

.08 

-.50 

-.97 

-.44 

Concentration 

.11 

-.01 

.07 

.09 

.01 

-.47 

-.97 

-.44 

Behavioral  Consistency 

.13 

.10 

.14 

.02 

.14 

-.53 

-.95 

-.45 

Cooperation 

.01 

-.08 

-.02 

-.02 

.17 

-.47 

-.96 

-.57 

Decisiveness 

.10 

-.03 

.05 

.15 

.01 

-.47 

-.99 

-.45 

Execution 

.07 

.08 

.07 

.04 

.00 

-.55 

-1.02 

-.43 

Flexibility 

.05 

.04 

.06 

-.01 

.10 

-.51 

-.94 

-.49 

Tolerance  for  High  Intensity 

.02 

-.02 

-.01 

-.01 

.01 

-.50 

-1.04 

-.49 

Self  Awareness 

.10 

-.07  * 

.06 

.05 

.04 

-.44 

-1.03 

-.46 

Self  Confidence 

.06 

-.04 

.08 

-.01 

-.03 

-.49 

-.99 

-.42 

Sustained  Attention 

.08 

.03 

.06 

.08 

.07 

-.50 

-1.05 

-.46 

Taking  Charge 

.02 

-.07 

-.02 

.07 

-.07 

-.47 

-1.01 

-.44 

Interpersonal  Tolerance 

.08 

-.02 

.09 

-.03 

.13 

-.48 

-.91 

-.50 

Task  Closure 

.07 

-.00 

.03 

.11 

.13 

-.49 

-1.01 

-.49 

LA:  Letter  Factory 

Situational  Awareness 

.31 

.30 

.24 

.18 

.25 

-.59 

-.89 

-.42 

Planning  &  Thinking  Ahead 

.32 

.20 

.22 

.37 

.24 

-.51 

-.82 

-.43 

ME:  Memory 

.24 

.16 

.14 

.31 

.27 

-.60 

-1.06 

-.50 

MR:  Memory  Retest 

.27 

.23 

.19 

.37 

.20 

-.67 

-1.03 

-.42 

PL:  Planes 

Projection 

.20 

.13 

.15 

.13 

.09 

-.50 

-.97 

-.43 

Visual/Spatial 

.17 

.12 

.10 

.16 

.18 

-.58 

-1.00 

-.40 

Timesharing 

.21 

.07 

.15 

.22 

.17 

-.50 

-1.00 

-.42 

SC:  Scan 

.22 

.08 

.18 

.18 

.00 

-.59 

-.91 

-.45 

SN:  Sound 

.16 

.02 

.15 

.31 

.15 

-.46 

-1.13 

-.52 

TW:  Time  Wall 

Perceptual  Accuracy 

.22 

.14 

.14 

.35 

.21 

-.53 

-.96 

-.42 

Perceptual  Speed 

.11 

.04 

.04 

.13 

-.15 

-.52 

-1.03 

-.38 

Time  Estimation  Accuracy 

.20 

.09 

.20 

.12 

.10 

-.41 

-.94 

-.41 
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Standardized  Slope  of  Regression 

Line  (i.e.,  validity  coefficient) 

Regression  Lines’ 
Difference  at  Cut 
Score 

(in  Std.  Dev.  Units) 

Predictor 

Male  Fern  White  Black  Hisp 

Fern  Black  Hisp 

*  p  <  .05 

Notes.  Ns  range  from  823-859  for  males,  159-170  for  females,  803-844  for  Whites,  80-90  for  Blacks,  and  60  for 
Hispanics.  There  were  no  significant  differences  in  slopes  or  intercepts  that  favored  the  reference  group.  Each  value 
in  the  three  columns  on  the  right,  labeled  Regression  Lines  Difference  at  Cut  Score  (in  Std.  Dev.  Units),  represents 
how  far  the  regression  line  for  the  minority  group  (i.e..  Female,  Black.  Hispanic)  is  above  the  regression  line  for  the 
reference  group  (i.e.,  Male,  White)  at  the  predictor’s  cut  score.  This  distance  is  expressed  in  standard  deviation  units 
based  on  the  regression  line  for  the  reference  group  (i.e.,  the  standard  error  of  estimate  of  the  Male  or  White 
regression  line).  A  negative  value  indicates  that  the  reference  group’s  regression  line  is  above  the  minority  group’s 
reference  line  at  the  cut  point.  The  scores  in  the  final  battery  are  boldfaced. 
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Table  5.6.7.  Criterion  ^-Scores  Analyses  for  Controllers 


Proportion  of  Controllers  in  Each  Group  Difference  in  Means 

Above  32nd  Percentile  in  Total  Sample  _ (in  Std.  Dev.  Units) 


Predictor 

Male 

Female 

White 

Black 

Hispanic 

Female 

Black 

Hispanic 

Predictor  Composite 

.70 

.56  *** 

.74 

.31  a*** 

.57  <■'** 

.  40  *** 

-1.35  *** 

-.50  *** 

Composite  Criterion  (of 
Ratings  and  CBPM) 

.71 

.55  <?*** 

.74 

.36  ***** 

.50  ***** 

-.47  *** 

-1.04  *** 

..47  *** 

Number  of  Controllers  in  Each  Group 


Male 

Female 

White 

Black 

Hispanic 

857 

170 

842 

90 

60 

The  following  significance  tests  were  performed  to  compare  the  minority  group  vs.  the  reference  group:  (a)  t- test  of 
the  difference  between  the  mean  scores  and  (b)  X  test  of  the  difference  between  the  passing  rates. 

*p  <.05 

**  p  <  .01 
***  p  <  .001 

a  The  passing  rate  of  this  group  (i.e.,  the  proportion  above  the  hypothetical  cut  score)  is  less  than  80%  of  the 
passing  rate  of  the  reference  group.  (The  hypothetical  cut  score  is  the  score  at  the  32nd  percentile  of  the  combined 
controller  sample.) 

Notes.  Participants  missing  either  the  composite  criterion  or  predictor  composite  scores  were  excluded  from  the 
analysis.  The  following  significance  tests  were  performed  to  compare  the  minority  group  vs.  the  reference  group:  (a) 
Mest  of  the  difference  between  the  mean  scores  and  (b)  X  test  of  the  difference  between  the  passing  rates.  Each 
value  in  the  three  columns  on  the  right,  labeled  Difference  in  Means ,  represents  the  difference  between  the  mean 
score  for  the  minority  group  (i.e.,  Female,  Black,  Hispanic)  and  the  mean  score  for  the  reference  group  (i.e.,  Male, 
White).  This  difference  is  expressed  in  standard  deviation  units  based  on  the  reference  group.  This  is  often  referred 
to  as  a  d- score.  A  negative  value  indicates  that  the  focal  group’s  mean  is  less  than  the  reference  group’s  mean. 
Significant  differences  in  means  are  asterisked  only  where  the  difference  favors  the  reference  group. 
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Table  5.6.8.  Power  Analysis  of  Fairness  Regressions 


Statistic 

Males 

Females 

Whites 

Blacks 

Hispanics 

Slope 

.052 

.045 

.049 

.037 

.055 

Smallest  detectable 
slope  difference  at  80% 
power,  p<.  05 

.18 

.020 

.036 

Intercept  at  Cut  Score 

-.13 

-.35 

-.11 

-.53 

-.31 

Smallest  detectable 
intercept  difference  at 

80%  power,  p<.05 

-.15 

-.21 

-.26 

Notes.  The  criterion  and  predictor  are  each  scaled  to  have  a  standard  deviation  of  one  and  a  mean  of  zero  for  the 
sample  of  all  controllers.  Smallest  detectable  difference  at  80%  power  is  the  minimum  difference  in  the  slope  or 
intercept  between  the  minority  group  and  its  reference  group  in  the  population  to  find  statistical  significance  in  80% 
of  the  samples. 


Table  5.6.9.  Potential  Impact  of  Targeted  Recruitment 


Group 

Recruiting  Strategy: 

AT-SAT 

%  At  or 

Above  70 

%  At  or 

Above  75 

Range 

Relative  Freq. 

Mean 

S.D. 

All 

All 

1 

58.8 

12.6 

18.8 

8.9 

Hispanics 

All 

1 

57.5 

10.8 

12.2 

5.1 

Blacks 

All 

1 

50.4 

11.1 

3.9 

1.3 

Top  10% 

6 

56.9 

13.2 

15.5 

5.1 

Top  5% 

5 

54.3 

13.3 

16.2 

5.3 

Notes.  Range  =  the  portion  of  the  potential  applicant  population  that  are  targeted  for  preferential  recruitment  efforts. 
Relative  Freq.  =  the  number  of  people  recruited  from  the  targeted  range  under  targeted  vs.  untargeted  recruitment. 
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Table  6.1.  Correlations  Between  Archival  and  AT-SAT  Criterion  Measures  (N=669) 


DaylX 

HrsIX 

IPIX 

DayXII 

HrsXII 

IPXII 

TFPL 

Rating 

CBPM 

Archival  criterion 

measures 

Days  in  Phase  IX 
(DaylX) 

.69** 

-.10* 

30** 

.33** 

.02 

45** 

-.07 

.02 

OJT  Hours  Phase  IX 
(HrsIX) 

-.07 

.33** 

.52** 

.06 

.41** 

m 

-.04 

Indication  of 
Performance  Phase 

IX  (IPIX) 

-.09* 

-.05 

.44** 

.07 

-.15** 

.03 

Days  in  Phase  XII 
(DayXII) 

.61** 

-.03 

.37** 

-.19** 

11** 

OJT  Hours  Phase 

XII  (HrsXII) 

-.05 

.36** 

-.14** 

Indication  of 
Performance  Phase 
XII  (IPXII) 

.10* 

Time  to  FPL  (TFPL) 

-.16** 

-.03 

AT-SAT  criterion 

measures 

Rating  Composite 
(Rating) 

22** 

Final  CBPM  score 
(CBPM) 

*  Significantly  different  from  0  at  p  <  .05. 


**  Significantly  different  from  0  at  p  <  .01. 
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Table  6.2.  Correlations  of  Archival  Selection  Procedures  with  Archival  and  AT-SAT  Criterion 
Measures  (Correlations  adjusted  for  restriction  in  range  of  the  predictors  are  in  parentheses 
following  the  restricted  correlations.  N=370) _ 


OPM  Rating 

Final  Nonradar 
Score 

Final  Radar 
Score 

Archival  selection  procedures 

OPM  Rating 

.18**  (.36) 

.11**  (.18) 

Final  score  in  Nonradar  Screen  Program 

.37**  (.63) 

Final  score  in  Radar  Training  program 

Archival  criterion  measures 

Days  in  Phase  IX  (DaylX) 

.03  (.06) 

-.09  (-.19) 

-.20**  (-.32) 

OJT  Hours  Phase  IX  (HrsIX) 

.07  (.11) 

-.12*  (-.25) 

-.18**  (-.28) 

Indication  of  Performance  Phase  IX  (IPIX) 

.09*  (.14) 

.10  (.20) 

-.01  (-.01) 

Days  in  Phase  XII  (DayXII) 

-.02  (-.03) 

-.09  (-.18) 

-.21**  (-.34) 

OJT  Hours  Phase  XII  (HrsXII) 

.03  (.05) 

-.13*  (-.25) 

-.22**  (-.34) 

Indication  of  Performance  Phase  XII 
(IPXII) 

.12*  (.20) 

.13*  (.26) 

.11*  (.18) 

Time  to  FPL  (TFPL) 

-.08  (-.12) 

-.17**  (-.34) 

-.22**  (-.35) 

AT-SAT  criterion  measures 

Rating  Composite  (Rating) 

.02  (.04) 

.12*  (.24) 

.17**  (.27) 

Final  CBPM  score  (CBPM) 

.22**  (.34) 

.34**  (.60) 

.21**  (.32) 

*  Significantly  different  from  0  at  p  <  .05. 
**  Significantly  different  from  0  at/;  <  .01. 
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Table  6.3.  Correlations  of  Archival  Selection  Procedure  Components  with  Archival  and  AT-SAT  Criterion 
Measures  (N=212) 


MCAT 

ABSR 

OKT 

AvIA 

AvTA 

NCST 

RavIA 

RavTA 

RCST 

Archival  selection  test 
components 

Multiplex  Controller 
Aptitude  Test 
(MCAT) 

.24** 

.04 

.29** 

.20** 

.17* 

.25** 

.22** 

.02 

Abstract  Reasoning 
(ABSR) 

-.12 

.12 

.16* 

.19** 

-.04 

-.02 

.08 

Occupational 

Knowledge  Test 
(OKT) 

20** 

.12 

i 

.09 

.04 

1 

.04 

i 

.05 

Average  Instructor 
Assessment  (AvIA) 

m 

.23** 

.37** 

.18** 

Average  Technical 
Assessment  (AvTA) 

23** 

.32** 

.25** 

Nonradar  Controller 
Skills  Test  (NCST) 

.16* 

.17* 

.25** 

Radar  Instructor 
Assessment  (RAvIA) 

.83** 

.05 

Radar  Technical 
Assessment  (RAvTA) 

.09 

Radar  Controller  Skills 
Test  (RCST) 

Archival  criterion 

measures 

Days  in  Phase  IX 
(DaylX) 

-.01 

.03 

-.01 

-.12 

-.16** 

-.08 

-.19* 

-.18* 

-.16* 

OJT  Hours  Phase  IX 
(HrsIX) 

.03 

.05 

-.03 

-.09 

-.17** 

-.10 

-.14* 

-.16* 

f-.io 

Indication  of 
Performance  Phase 

IX  (IPIX) 

.00 

.07 

-.01 

.07 

.00 

.01 

.12 

.06 

-.03 

Days  in  Phase  XII 
(DayXII) 

.13 

.06 

.12 

-.04 

-.11 

-.02 

-.11 

-.18* 

-.21** 

OJT  Hours  Phase  XII 
(HrsXII) 

.09 

.12 

-.01 

-.02 

-.09 

-.07 

-.12 

-.21  ** 

-.11 

Indication  of 
Performance  Phase 

Xn  (IPXII) 

.12 

.14 

.12 

.05 

.02 

.10 

.10 

.16* 

.14* 
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MCAT 

ABSR 

OKT 

NCST 

RavlA 

RavTA 

RCST 

Time  to  FPL  (TFPL) 

-.06 

.06 

-.02 

-.05 

-.01 

AT -SAT  criterion 

measures 

Rating  Composite 
(Rating) 

.01 

-.06 

.06 

.17* 

.09 

.05 

.13 

.13 

-.02 

Final  CBPM  score 
(CBPM) 

21** 

.13 

.15* 

.29** 

.35** 

.32** 

.17* 

.15* 

.31** 

*  Significantly  different  from  0  at  p  <  .05. 
**  Significantly  different  from  0  at;;  <  .01. 
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Table  6.4.  Correlations  of  Criterion  Measures  from  High  Fidelity  Simulation  with  Archival  Performance- 
Based  Predictors  and  Criterion  Measures. 


MSep 

MFlow 

A-SA 

Comm 

Coord 

MTask 

SectWk 

Hifirate 

OES7 

High  fidelity  simulation  ratings 

Maintain  separation  (MSep) 

.83** 

.86** 

.82** 

.83** 

.84** 

.89** 

-.32** 

107 

I 

107 

107 

107 

107 

103 

Maintain  efficient  AT  flow 

92** 

92** 

g9** 

.95** 

.95** 

.95** 

-.20* 

(MFlow) 

107 

107 

107 

107 

107 

107 

103 

Attention,  Situation  Awareness 

90** 

.88** 

.92** 

93** 

95** 

-.23* 

(A-SA) 

107 

107 

107 

107 

107 

103 

Communications  (Comm) 

.88** 

94** 

94** 

ESI 

-.24* 

107 

107 

107 

■  :  . 

103 

Coordination  (Coord) 

92** 

.91** 

92** 

-.20* 

107 

107 

107 

103 

Multiple  tasks  (Mtask) 

92** 

92** 

-.23* 

107 

107 

103 

Managing  sector  workload 

92** 

-.24* 

(SectWk) 

107 

103 

Overall  rating  (Hif irate) 

-.25* 

103 

Number  of  operational  errors  in 
scenario  7  (OES7) 

ATS  AT  criterion  measures 

Rating  Composite  (Rating) 

.34** 

.40** 

.34** 

.37** 

.42** 

41** 

.42** 

.38** 

.09 

62 

62 

62 

62 

62 

62 

62 

62 

59 

Final  CBPM  score  (CBPM) 

.57** 

.64** 

.60** 

.64** 

.65** 

.65** 

.68** 

.63** 

-.05 

62 

62 

62 

62 

62 

62 

62 

62 

59 

Archival  criterion  measures 

i 

OJT  Hours  Phase  IX  (HrsIX) 

-.14 

-.34* 

-.26 

-.27 

-.27 

-.33* 

-.32* 

-.30* 

53 

53 

53 

53 

53 

53 

53 

53 

52 

Indication  of  Performance  Phase 

.08 

-.02 

.02 

.03 

-.01 

-.02 

.01 

.00 

IX  (IPIX) 

56 

56 

56 

56 

56 

56 

56 

56 

55 

OJT  Hours  Phase  XII  (HrsXII) 

!  -.24 

|  -.35* 

■■ 

-.22 

mm 

-.29* 

-.35* 

.08 

51 

51 

HH 

mm 

51 

mm 

51 

51 

50 

Ind.  of  Performance  Phase  XII 

-.05 

-.04 

-.01 

.01 

-.09 

-.11 

-.04 

-.05 

-.03 

(IPXII) 

56 

56 

56 

56 

56 

56 

56 

56 

55 

Time  to  FPL  (TFPL) 

-.11 

-.30* 

-.26* 

-.21 

-.18 

-.25 

-.25 

-.23 

.23 

45 

45 

45 

45 

45 

45 

45 

45 

44 

Archival  performance-based 
selection  test  components 

Nonradar  Average  Instructor 

.38** 

.37** 

.43** 

.36** 

.40** 

.36** 

.35** 

41** 

-.23 

Assessment  (AvIA) 

55 

55 

55 

55 

55 

55 

55 

55 

53 

Nonradar  Average  Technical 

.53** 

.50** 

.57** 

.48** 

.51** 

.53** 

49** 

.54** 

-.21 

Assessment  (AvTA) 

55 

55 

55 

55 

55 

55 

55 

55 

53 

Nonradar  Controller  Skills  Test 

.21 

.24 

.24 

.18 

.27* 

.24 

.23* 

.26 

.09 

(NCST) 

55 

55 

55 

55 

55 

55 

55 

55 

53 

Radar  Average  Instructor 

-.18 

-.08 

-.07 

-.12 

-.15 

-.03 

-.11 

-.10 

-.35 

Assessment  (RIA) 

30 

30 

30 

30 

30 

30 

30 

30 

29 

Radar  Average  Technical 

.51** 

.57** 

.55** 

.60** 

.55** 

.65** 

.58** 

WEtm 

-.12 
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MSep 

MFlow 

A-SA 

Comm 

Coord 

MTask 

SectWk 

OES7 

Assessment  (RTA) 

30 

30 

30 

30 

30 

30 

30 

30 

29 

Radar  Controller  Skills  Test 

.71** 

.60** 

.67** 

.61** 

.66** 

.62** 

.69** 

.67** 

-.01 

(RCST) 

21 

21 

21 

21 

21 

21 

21 

21 

21 

Table  6.5.  Correlations  Between  OPM  Selection  Tests  and  AT-SAT  Predictor 
Tests  (N=561).  _ 


MCAT 

Abstract 

Reasoning 

OKT 

AT-SAT  Predictor  Tests 

Applied  Math:  N  items  correct 

.15** 

(.21) 

.18**  (.21) 

-.06 

Angles:  N  items  correct 

.09* 

(.13) 

.23**  (.27) 

.04 

Dials:  N  items  correct 

.11** 

(.15) 

.13**  (.16) 

-.03 

Memory:  N  items  correct 

.10* 

(.14) 

.12**  (.14) 

-.11* 

Memory  Recall:  N  items  correct 

.10* 

(.14) 

.14**  (.17) 

-.12** 

Digit  Span:  N  items  correct 

.10* 

(.14) 

.04  (.05) 

-.05 

Time  Wall:  Time  Estimation  Accuracy 

.13** 

(.18) 

.16**  (.19) 

-.08 

Time  Wall:  Perceptual  Accuracy 

.09* 

(.13) 

.13**  (.16) 

_  i  ]  ** 

Time  Wall:  Perceptual  Speed 

.07 

(.10) 

.07  (.08) 

.00 

AT  Scenarios:  Efficiency 

.13** 

(.18) 

.09*  (.11) 

-.07 

AT  Scenarios:  Safety 

.11** 

(.15) 

.09*  (.11) 

-.07 

AT  Scenarios:  Procedural  Accuracy 

.02 

(.03) 

-.02  (-.02) 

.09* 

Analogies:  Reasoning 

12** 

(.17) 

.33**  (.39) 

-.08 

Analogies:  Latency 

.04 

(.06) 

.01  (.01) 

.01 

Analogies:  Information  Processing 

.03 

(.04) 

-.04  (-.05) 

-.02 

Letter  Factories:  N  letters  correctly  placed 

.13** 

(.18) 

.10*  (.12) 

-.07 

Letter  Factories:  Planning,  Thinking  ahead 

.15** 

(.21) 

.17**  (.20) 

-.16** 

Letter  Factories:  Situational  Awareness 

.17** 

(.24) 

.25**  (.30) 

-.  19** 

Planes:  Projection 

.04 

(.06) 

.01  (.01) 

-.08 

Planes:  Dynamic  Visual/Spatial 

.03 

(.04) 

.04  (.05) 

-.09* 

Planes:  Timesharing 

.10* 

(.14) 

.07  (.08) 

-.06 

Scan:  Total  Score 

1 1** 

(.15) 

.10*  (.12) 

-.04 

*  Significantly  different  from  0  at  p  <  .05. 
**  Significantly  different  from  0  at  p  <  .01. 
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Table  6.6.  Correlations  of  AT-SAT  Applied  Math,  Angles,  and  Dials  tests  with  Archival  Dial  Reading, 
Directional  Headings,  Math  Aptitude  Tests,  and  High  School  Math  Grades  Biographical  Item. 


Applied  Math:  N 
Items  Correct 

Angles:  N  Items 
Correct 

Dials:  N  Items  Correct 

AT-SAT  Predictor  Tests 

Applied  Math:  N  items 

.51** 

39** 

correct 

1043 

1043 

Angles:  N  items  correct 

.31** 

1043 

Dials:  N  items  correct 

Archival  tests 

Dial  reading:  N  items 

.52** 

37** 

22** 

correct 

145 

145 

145 

Dial  reading:  N  items  wrong 

-.36** 

-.28** 

_  39** 

139 

139 

139 

Directional  Headings:  N 

.12 

-.05 

correct  Part  1 

171 

171 

171 

Directional  Headings:  N 

-.01 

-.07 

-.13 

correct  Part  2 

171 

171 

171 

Directional  Headings:  N 

.13 

.07 

.04 

wrong  Part  1 

99 

99 

99 

Directional  Headings:  N 

.14 

.18* 

.16 

wrong  Part  2 

142 

142 

142 

Math  Aptitude:  Total  score 

.63** 

.41** 

.29** 

240 

240 

240 

Biographical  item:  Math 

-.34** 

-  21** 

-.13** 

grades  in  HS 

483 

482 

482 

Table  6.7,  Correlation  of  the  Version  of  Air  Traffic  Scenarios  Test  Used  in  Pre- 
Training  Screen  Validation  with  the  Version  of  Air  Traffic  Scenarios  Test  Used 
in  AT-SAT  Validation  (N=61) 


AT-SAT  Air  Traffic  Scenarios  Test  Score 

PTS  Air  Traffic  Scenarios  Test  Score 

Safety 

Efficiency 

Procedural 

Accuracy 

Average  Safety  Score 

-.42** 

-.33** 

-.06 

Average  Total  Delay  Time 

*  i , ,  i  _ .  „  -  nc 

-.06 

-.45** 

.05 

*  Statistically  significant  at  p  <  .05. 
^Statistically  significant  at  p  <  .01. 
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Table  6.8.  Oblique  Principal  Components  Analysis  of  EQ  Scales 


Factor 

EQ  Scales 

IP^HI 

A 

B 

Composure 

.657 

.701 

.179 

Concentration 

.706 

.728 

.183 

Self  Confidence 

.742 

.905 

-.086 

Sustained  Attention 

.604 

.540 

.340 

Decisiveness 

.766 

.855 

.036 

Execution 

.671 

.820 

-.019 

Flexibility 

.647 

.639 

.254 

Taking  Charge 

.667 

.902 

-.191 

Task  Closure/ 
Thoroughness 

.679 

.736 

.148 

Tolerance  for  High 
Intensity 

.594 

.820 

-.102 

Interpersonal 

Tolerance 

.765 

-.085 

.917 

Consistency  of  Work 
Behaviors 

.638 

.110 

.735 

Working 

Cooperatively 

.652 

.056 

.776 

Self  Awareness 

.382 

.337 

.368 

Variance  Explained 

56.02% 

9.49% 
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Table  6.9.  Description  of  16PF  Scales. 


Low  score 

High  score 

Factor  A: 

Reserved,  detached,  critical, 
aloof,  stiff 

vs. 

Warmhearted,  outgoing,  easygoing,  participating 

Factor  B: 

Poorer  judgment,  low  mental  capacity 

vs. 

Better  judgment,  high  mental  capacity 

Factor  C: 

Emotionally  less  stable,  easily  upset 

vs. 

Emotionally  stable,  calm 

Factor  E: 

Obedient,  mild,  submissive 

vs. 

Assertive,  aggressive,  dominance 

Factor  F: 

Serious,  silent,  slow 

vs. 

Happy-go-lucky,  talkative,  quick 

Factor  G: 

Undependable,  frivolous 

vs. 

Conscientious,  responsible 

Factor  H: 

Shy,  careful,  restrained 

vs. 

Adventurous,  carefree,  impulsive 

Factor  I: 

Tough-minded,  acts  on  practical 

vs. 

Tender-minded,  acts  on  sensitive  intuition 

Factor  L: 

Trusting,  conciliatory,  accepting 
conditions 

vs. 

Suspecting,  irritable,  jealous 

Factor  M: 

Practical,  conventional 

vs. 

Imaginative,  unconventional 

Factor  N: 

Naivete,  genuine 

vs. 

Shrewdness,  polished 

Factor  0: 

Self-assured,  secure,  cheerful 

vs. 

Apprehensive,  insecure,  depressed. 

Factor  QL: 

Conservative,  respecting 
traditional  ideas 

vs. 

Experimenting,  liberal 

Factor  Q2: 

Socially  group  dependent 

vs. 

Self-sufficient 

Factor  Q3: 

Careless  of  social  rules,  uncontrolled 

vs. 

Socially  precise,  controlled 

Factor  Q4: 

Relaxed,  composed 

vs. 

Tense,  fretful 
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Table  6.10.  Correlation  of  EQ  and  16PF  Scales 

EQ 

Task 

* 

* 

r- 

in 

* 

* 

nr 
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# 
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Factor  Q, _ .15  .06  -.12*  -.05  .10*  .10*  .09  .09  .05  .12*  .05  .11*  .01  ,05 

Factor  Q2 _ .06  -.10*  -.21**  -.30**  -.11*  -.05  -.06  -.07  -.02  -.05  -.08  -.19**  -.10*  -.11* 

Factor  Q3  .23**  .18**  .23**  .15**  .23**  .23**  .14**  .18**  .20**  .24**  .18**  .15**  .12*  .24** 

Factor  Q4 _ -.34**  -.23**  -.22**  -.28**  -.29**  -.25**  -.20**  -.23**  -.25**  -.23**  -.27**  -.20**  -.25**  -.23** 


FIGURES  AND  TABLES 


Table  6.11.  Results  of  Multiple  linear  Regression  of  OPM  Rating,  Final  Score  in  Nonradar  Screen 
Program,  and  AT-SAT  Predictor  Tests  on  AT-SAT  Composite  Criterion  Measure  (N=586) 


Variable 

R 

R” 

Beta 

t 

Sig.  level 

Analogies:  Reasoning 

.314 

.099 

.099 

.22 

5.16 

.001 

Final  score  in  Nonradar  Screen 
program 

.417 

.174 

.075 

.25 

6.73 

.001 

Applied  Math:  N  correct 

.431 

.186 

.012 

.11 

2.68 

.008 

Scan:  Total  Score 

.444 

.197 

.011 

.10 

.007 

EQ:  Unlikely  virtues 

.455 

.207 

.010 

-*.10 

-2.73 

.007 

AT  Scenarios:  Procedural 
Accuracy 

.465 

.216 

.009 

.10 

2.56 

.011 
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APPENDIX  C 


Criterion  Assessment  Scales 
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AT-SAT  Rating  Instructions 


This  booklet  contains  ten  categories  you  will  use  to  make  assessment  ratings  as  part  of  the 
AT-SAT  project.  Each  category  contains: 

1.  A  Category  Definition  provided  immediately  below  the  category  title. 

2.  Rating  Standards  provided  above  the  seven-point  rating  scale.  These  broad  summary  statements  describe 
air  traffic  controller  proficiency  at  different  effectiveness  levels  to  help  make  your  ratings  more  objective. 

Making  Your  Ratings 

For  each  category,  read  the  category  definition  and  rating  standards.  Then,  compare  the  controller’s  current 
effectiveness  with  the  rating  standards  for  that  category. 

If  you  feel  that  the  middle  statements  describe  the  controller’s  most  typical  effectiveness,  choose  a  "4.”  If  the 
statements  describing  high  effectiveness  on  the  right  of  the  scale  closely  match  the  controller’s  most  typical 
behavior,  choose  a  rating  of  "6"  or  "7."  Likewise,  if  the  statements  on  the  left  of  the  scale  match  the  controller’s 
most  typical  effectiveness,  choose  a  rating  of "  1 "  or  "2." 

If  the  controller  behaves  as  described  in  the  low  statements  some  of  the  time  but  performs  like  the  middle  statements 
more  of  the  time,  a  rating  of  "3"  would  be  best.  Similarly,  if  both  the  middle  and  high  level  statements  describe  a 
controller  at  various  times  but  the  high  statements  are  more  descriptive,  the  fairest  rating  to  give  the  controller  is 
probably  a  ”6." 

Please  use  these  statements  to  help  make  your  ratings  more  objective. 

Once  you  have  selected  a  rating,  make  your  rating  by  blackening  the  appropriate  circle  on  the  Criterion  Assessment 
Rating  Sheet.  Please  make  no  marks  in  this  booklet. 

Important  Points  to  Remember 

1.  Try  not  to  give  a  controller  the  same  rating  for  all  ten  categories.  Most  people  will  perform  well  in  some 
categories  and  less  effectively  in  others.  Your  ratings  should  show  the  controller's  strengths  and  weaknesses,  as 
appropriate. 

2.  If  you  are  rating  multiple  controllers,  try  not  to  give  all  of  them  the  same  rating  within  each  individual  category. 
Instead,  your  ratings  should  indicate  who  is  performing  more  effectively  and  who  is  performing  less  effectively 
in  each  category. 

3.  Avoid  being  influenced  by  such  things  as  appearance,  family  background,  and  other  personal  characteristics  that 
are  not  directly  related  to  performance. 

4.  Please  rate  independently  (do  not  confer  with  others). 

5.  The  most  important  point  is  to  make  your  ratings  as  accurate  as  possible.  This  is  the  best  way  to  help  us  validate 
the  new  selection  procedures. 
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A.  Maintaining  Safe  &  Efficient  Air  Traffic  Flow 


How  effective  is  each  controller  at  maintaining  safe  and  efficient  air  traffic  flow? 


Sometimes  fails  to  maintain  minimum 
separation  or  to  recognize  and  resolve 
potential  conflictions. 

Uses  control  actions  that  fail  to  resolve 
potential  conflictions  or  that  result  in 
excessive  workload  (e.g.,  waits  until 
potential  conflictions  are  critical  before 
taking  action,  fails  to  take  wind  into 
account,  etc.) 

Does  not  always  sequence  aircraft 
adequately  or  ensure  proper  spacing 
between  aircraft;  may  cause  excessive 
and  unnecessary  delays  by  choosing 
poor  control  actions,  waiting  too  long  to 
provide  needed  commands, 
unnecessarily  vectoring  or  rerouting 
aircraft,  etc. 


Typically  uses  appropriate  control 
actions  to  maintain  proper  separation  or 
to  resolve  potential  conflictions. 

Resolves  simple  conflictions  and  traffic 
flow  problems  without  causing 
unnecessary  delays. 

Generally  uses  correct  procedures  to 
sequence  and  space  aircraft  safely; 
maintains  smooth  traffic  flow,  but  may 
not  use  the  most  efficient  control 
actions  (e.g.,  may  not  always  take 
aircraft  types  into  account). 


Consistently  maintains  safe,  efficient, 
and  orderly  traffic  flow,  even  under 
difficult  or  unusual  circumstances  (e.g., 
extremely  heavy  traffic,  bad  weather, 
etc.) 

Consistently  recognizes  potential 
problems  or  conflictions  well  in 
advance  and  takes  highly  effective 
action  to  maintain  separation  and 
efficient  air  traffic  flow. 

Sequences  and  spaces  traffic  effectively 
and  efficiently,  even  when  extremely 
busy  (e.g.,  by  taking  aircraft  types  into 
account);  always  maintains  proper 
separation  while  minimizing  delays 
(e.g.,  avoids  delaying  vectors  as 
appropriate). 
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B.  Maintaining  Attention  &  Vigilance 


How  effective  is  each  controller  at  maintaining  attention  and  vigilance? 


Has  a  tendency  to  focus  too  narrowly  on 
one  air  traffic  problem  and  sometimes 
fails  to  scan  the  radar  scope  for  other 
potential  problems  with  conflictions, 
traffic  flow,  weather,  etc. 

Often  does  not  recognize  that  an  action 
is  required;  is  often  lax  in  watching  the 
radar  scope  and  tends  to  significantly 
reduce  vigilance  during  slow  periods. 


For  the  most  part,  properly  scans  the 
scope  and  monitors  aircraft  to  maintain 
awareness  of  air  traffic  events,  potential 
problems,  etc. 

Is  attentive  to  the  radar  scope  and 
maintains  vigilance,  especially  during 
rush  periods;  may  occasionally  be  less 
attentive  when  traffic  is  light. 


Consistently  recognizes  potentially 
dangerous  conditions  such  as  errors 
made  by  pilots  (e.g.,  wrong  turns, 
descending  or  climbing  through 
assigned  altitudes,  etc.). 

Always  monitors  the  radar  scope  to 
ensure  that  clearances  and  other 
instructions  to  pilots  are  followed; 
remains  highly  vigilant,  even  during 
slow  periods. 
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C.  Prioritizing 


How  effective  is  each  controller  at  prioritizing? 


Has  difficulty  recognizing  which  air 
traffic  problems  are  the  most  pressing; 
may  deal  with  problems  in 
chronological  order,  or  take  the  easy 
ones  first. 

Often  fails  to  prioritize  activities,  acting 
on  air  traffic  problems  without 
evaluating  the  possible  consequences  of 
own  actions. 

Puts  off  decisions  and  actions  that 
should  be  taken  right  away. 
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Generally  recognizes  the  most 
important  air  traffic  problems  and 
handles  them  before  the  less  pressing 
ones. 

When  prioritizing  own  actions, 
normally  looks  ahead  to  assess  potential 
air  traffic  problems  that  might  result 
from  own  actions. 

Usually  takes  early  or  prompt  action  to 
deal  with  air  traffic  problems. 
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Always  recognizes  which  air  traffic 
problems  need  immediate  attention  and 
handles  them  before  less  pressing  ones; 
consistently  uses  appropriate  priorities 
for  control  actions. 

Prioritizes  activities  with  extreme 
effectiveness,  consistently  looking 
ahead  and  accurately  predicting 
problems  that  will  result  from  revised 
clearances,  rapidly  degrading  weather, 
etc. 

Invariably  takes  early  or  prompt  action 
to  resolve  air  traffic  problems. 
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D.  Communicating  &  Informing 


How  effective  is  each  controller  at  communicating  and  informing? 


Is  consistently  too  wordy,  imprecise  in 
phraseology,  or  uses  slang 
inappropriately  during  transmissions  to 
pilots,  other  controllers,  TMU,  etc.;  may 
be  difficult  to  understand. 

Is  frequently  careless  about  informing 
pilots  concerning  circumstances  that 
affect  them  such  as  weather,  nearby 
traffic,  etc. 

Often  fails  to  ensure  that  own 
instructions  are  understood;  is  not  very 
good  at  picking  up  on  errors  in  pilot 
readbacks  of  clearances,  course 
changes,  etc. 


Radio  and  interphone  communications 
are  almost  always  easy  to  understand; 
occasionally  may  be  somewhat  wordy 
or  use  ambiguous  phraseology  on  the 
air. 

Is  normally  good  at  informing  pilots 
about  situations  and  conditions  that 
affect  them  (e.g..  safety-related  weather, 
nearby  traffic,  etc.);  gives  adequate 
relief  briefings  to  relieving  controllers. 

For  the  most  part,  checks  to  be  certain 
that  own  instructions  are  understood; 
only  occasionally  fails  to  pick  up  on 
inaccurate  readbacks  from  pilots. 


Always  uses  clear  and  concise 
phraseology  when  talking  to  pilots  or 
other  controllers;  is  very  easy  to 
understand. 

Consistently  provides  pilots  with  the 
information  they  need,  such  as  timely 
safety  alerts,  weather  advisories, 
warnings  about  unpublished 
obstructions,  etc.;  gives  complete  and 
thorough  relief  briefings  to  relieving 
controllers. 

Communicates  in  a  highly  effective 
manner,  always  ensuring  that  own 
instructions  are  clearly  understood; 
conscientiously  attends  to  pilot 
readbacks  of  clearances,  assigned 
altitudes,  course  changes,  etc. 
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E.  Coordinating 


How  effective  is  each  controller  at  coordinating? 


Is  often  ineffective  in  receiving  or 
initiating  hand-offs  (e.g.,  may  often  fail 
to  contact  controller  in  adjacent  sector 
even  when  a  hand-off  is  clearly 
required). 

When  coordination  is  required,  often 
fails  to  contact  appropriate  persons 
(e.g.,  pilot,  other  controllers,  tower, 
etc.)  or  does  so  too  slowly,  sometimes 
causing  traffic  problems,  delays,  or 
worse. 

® 


Is  generally  good  at  hand-offs  and 
pointouts,  but  may  be  somewhat  slow  in 
using  hand-off  line  when  very  busy. 

When  the  situation  calls  for 
coordination,  usually  contacts  all 
appropriate  persons  and  coordinates 
properly  with  others. 
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Always  coordinates  hand-offs  and 
pointouts  appropriately,  both  initiating 
and  receiving  them  very  effectively  and 
efficiently,  even  when  very  busy. 

Even  in  a  tight  time  frame  or  difficult 
circumstances,  always  contacts  and 
works  with  other  controllers  and  pilots, 
as  appropriate;  effectively  and 
efficiently  coordinates  to  correct  and 
avoid  traffic  problems  or  to  reduce 
confusion  and  workload. 
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F.  Managing  Multiple  Tasks 


How  effective  is  each  controller  at  managing  multiple  tasks? 


Has  difficulty  keeping  track  of  several 
aircraft  at  the  same  time;  may  focus  too 
narrowly  on  some  aircraft  while 
ignoring  others. 

Is  ineffective  at  performing  multiple 
tasks  simultaneously,  even  when  the 
tasks  are  fairly  routine  (e.g.,  talking  to 
pilots  and  writing  on  strips);  prefers  to 
“deal  with  one  thing  at  a  time.” 

Interruptions  and  distractions  often 
cause  him/her  to  forget  about  some  of 
the  immediate  air  traffic  problems;  may 
be  slow  in  recalling  what  he/she 
intended  to  do  with  the  traffic  before  the 
interruption 
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Is  usually  able  to  keep  on  top  of 
movement  of  several  aircraft 
simultaneously,  while  also  dealing  with 
pilot  communications,  the  flight  strips, 
etc.;  when  very  busy,  may  have  to 
simplify  the  situation  (e.g.,  vector 
aircraft,  put  off  some  communications, 
etc.) 

Is  able  to  perform  two  or  more  routine 
tasks  at  the  same  time  (e.g.,  monitoring 
the  screen,  talking  with  pilots,  and 
handling  strips.) 

After  an  interruption,  does  not  usually 
have  much  trouble  handling  the  air 
traffic  problems  remaining  from  prior  to 
the  interruption. 
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Is  extremely  adept  at  keeping  track  of 
many  aircraft  while  at  the  same  time 
handling  pilot  communications,  strip 
work,  etc. 

Effortlessly  performs  two  or  more 
complex  tasks  simultaneously  (e.g., 
sequencing  arrival  traffic,  dealing  with 
holding  aircraft  and  approaches, 
conducting  non-radar  procedures,  etc.) 

After  an  interruption,  immediately 
remembers  where  aircraft  are  or  should 
be,  what  he/she  was  doing  with  traffic 
before  the  interruption,  how  the 
intended  control  strategy  for  aircraft 
was  to  be  carried  out,  etc. 
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G.  Reacting  to  Stress 


How  effective  is  each  controller  at  reacting  to  stress? 


Becomes  shaken  and  ineffective  in 
emergency  situations. 

Reacts  poorly  and  performance  suffers 
under  stressful  air  traffic  conditions. 

Does  not  maintain  his/her  composure 
when  serious  problems  arise. 


Remains  calm  and  cool  in  most 
emergency  situations. 

Stays  calm,  focused,  and  functional 
under  busy  and/or  somewhat  stressful 
conditions. 

Shows  professional  cool  in  handling 
routine  problems. 


Remains  very  calm  and  cool  and  reacts 
effectively  even  in  very  serious 
emergency  situations  such  as  in-flight 
emergencies,  lost  pilots,  VFR  pilots  in 
IFR  conditions,  etc. 

Stays  calm,  focused,  and  very 
functional  in  busy,  and  very  stressful 
conditions  (e.g.,  sudden  weather 
problems  that  severely  reduce  usable 
airspace). 

Handles  even  serious  problems  with 
professional  cool. 
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H.  Adaptability  &  Flexibility 


How  effective  is  each  controller  in  the  area  of  adaptability  and  flexibility? 


Does  not  adjust  well  to  unusual  and 
difficult  air  traffic  situations. 

Rarely  displays  good  “fail-back” 
strategies  for  dealing  with  unanticipated 
air  traffic  problems. 

Is  ineffective  at  handling  air  traffic 
situations  with  no  clearly  prescribed 
procedures. 
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Is  usually  able  to  adapt  effectively  to 
most  situations  such  as  worsening 
weather,  equipment  problems,  etc. 

Frequently,  but  not  always,  has 
effective  contingency  strategies  for 
unforeseen  or  unanticipated  air  traffic 
problems  when  they  arise. 

For  the  most  part,  is  good  at  handling 
air  traffic  situations  that  have  no 
“textbook  answers,”  but  does  better 
with  the  more  routine  problems. 
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Reacts  expediently  and  effectively  to 
even  the  most  complicating  events  (e.g., 
quickly  devises  and  executes  a  complex 
re-route  plan  for  several  aircraft  when 
thunderstorms  begin  forming). 

Is  very  adept  at  using  effective 
contingency  or  “fall-back”  strategies 
when  unforeseen  or  unanticipated  air 
traffic  problems  arise. 

Deals  effectively  with  even  very 
difficult  air  traffic  situations  where 
there  are  no  clearly  prescribed 
procedures. 
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I.  Technical  Knowledge 


How  effective  is  each  controller  in  the  area  of  technical  knowledge? 


Is  not  very  good  at  remaining  current 
on  new  letters  of  agreement,  revised 
air  traffic  procedures,  etc. 

Sometimes  makes  errors  related  to  not 
knowing  aircraft  limitations. 

Is  unfamiliar  with  some  of  his/her 
equipment  and  how  it  works. 


Is  usually  knowledgeable  about  and 
up-to-date  on  most  information 
relevant  to  controlling  traffic  (e.g., 
letters  of  agreement,  air  traffic 
procedures,  etc.) 

Has  adequate  knowledge  of  different 
aircrafts’  capabilities  and  applies  that 
knowledge  to  avoid  most  errors 
associated  with  not  knowing  aircraft 
limitations. 

Is  reasonably  familiar  with  his/her 
equipment  and  how  it  works. 


Always  keep  up-to-date  on  letters  of 
agreement,  all  pertinent  procedures 
and  policies,  any  sector-specific 
changes  (e.g.,  revised  restricted  area 
boundaries),  etc. 

Is  an  expert  regarding  different 
aircrafts’  capabilities  and,  as  a  result, 
never  makes  errors  such  as  climbing 
an  aircraft  beyond  its  limits,  making 
an  inappropriate  speed  assignment, 
requiring  an  impossibly  tight  turn,  etc, 

Is  extremely  knowledgeable  about 
and  familiar  with  his/her  equipment 
and  how  it  functions. 
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J.  Teamwork 

How  effective  is  each  controller  in  the  area  of  teamwork? 


Ignores  traffic  flow  in  adjacent  sectors 
and  the  impact  own  traffic  flow  may 
have  on  co-workers;  avoids  pitching 
in  to  help  fellow  controllers,  even  in 
high  load  situations  such  as  loss  of 
radar  or  poor  weather  conditions. 

Often  waits  until  the  last  minute  to 
take  hand-offs;  frequently  dumps  air 
traffic  in  adjacent  sectors  so  as  to 
reduce  own  workload;  rarely 
volunteers  to  take  on  additional 
responsibility  to  help  co-workers. 

Becomes  extremely  defensive,  even 
belligerent,  if  constructive  feedback  is 
offered  by  supervisors  or  co-workers; 
may  belittle  co-workers,  sometimes  in 
front  of  others;  rarely  works  well  with 
others. 
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Is  usually  willing  to  assist  co-workers 
who  become  extremely  busy  (e.g.,  by 
assuming  hand-off  and  coordination 
duties). 

Is  generally  considerate  of  co¬ 
workers;  adjusts  own  traffic  flow  to 
ease  workload  of  adjacent  sector 
when  there  are  obvious  problems. 

For  the  most  part  accepts  constructive 
criticism  from  supervisors  or  co¬ 
workers;  is  usually  able  to  refrain 
from  criticizing  other  ATCSs; 
generally  works  well  with  other 
controllers. 
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Is  always  alert  to  traffic  in  other 
sectors  and  pitches  in  to  help  co¬ 
workers  (e.g.,  by  accepting  additional 
airspace  or  assuming  hand-off  and 
coordination  duties). 

Is  always  considerate  of  co-workers, 
working  to  ensure  smooth  and  timely 
traffic  flow  between  adjacent  sectors; 
whenever  possible,  adjusts  own  traffic 
flow  to  ease  workload  of  next  sector 
(e.g.,  when  traffic  in  adjacent  sectors 
becomes  heavy). 

Is  always  open  to  feedback  from 
supervisors  or  co-workers,  accepting 
criticism  in  a  positive,  constructive, 
and  professional  manner;  never 
belittles  co-workers;  always  works 
harmoniously  with  other  controllers. 
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Overall  Effectiveness 


The  scales  you  have  just  made  ratings  on  represent  10  different  areas  important  for  air  traffic  controller  effectiveness. 
This  scale  asks  you  to  rate  the  overall  effectiveness  of  each  controller,  taking  into  account  behavior  related  to  all  10  of 
the  previous  categories. 


Performs  poorly  in  important 
effectiveness  areas;  does  not  meet 
standards  and  expectations  for 
adequate  controller  performance. 


Adequately  performs  in  important 
effectiveness  areas;  meets  standards 
and  expectations  for  adequate 
controller  performance. 


Performs  excellently  in  all  or  almost 
all  effectiveness  areas;  exceeds 
standards  and  expectations  for 
controller  performance. 
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APPENDIX  D 


Rater  Training  Script 


Conducting  the  Rating  Sessions 


It  is  likely  that  you  will  have  to  conduct  many  rating  sessions  in  order  to  accommodate  different  raters’  schedules. 
The  facility  management  will  schedule  these  sessions,  with  help  and  input  from  the  data  collection  team.  Please 
conduct  these  sessions  as  efficiently  as  possible,  so  the  raters  can  get  back  to  their  jobs  as  quickly  as  possible. 

At  the  beginning  of  each  rating  session,  you  will  need  to  have  the  following  rating  materials  for  each  rater.  You  will 
also  need  the  Criterion  Assessment  Master  Roster  and  the  Master  List. 

1.  One  Biographical  Information:  Assessor  sheet  (for  those  assessors  who  are  not  also  participants) 

2.  The  Criterion  Assessment  Rating  Sheet  prepared  for  each  rater  (Up  to  five  controllers  can  be  rated  per  answer 
sheet,  therefore,  some  raters  may  have  more  than  one  sheet  if  they  are  rating  more  than  five  controllers) 

3.  One  copy  of  the  Criterion  Assessment  Scales 

4.  Two  sharp  pencils  with  erasers 

5.  One  “Confidential”  envelope 

6.  One  Project  Overview  (for  those  Assessors  who  are  not  also  participants) 

When  raters  arrive  for  their  rating  sessions,  check  the  raters  who  arrive  against  a  roster  of  those  you  expect  in  a 
given  session.  Give  each  rater  the  appropriate  code  number  card,  and  ask  them  to  hang  onto  it. 

When  conducting  rating  sessions,  follow  the  steps  below  exactly  as  they  are  presented.  Instructions  that  you  should 
give  to  the  raters  appear  in  italics.  Special  instructions  for  administrators  appear  in  regular  type. 


In troductory  Brie fi ng 

If  you  have  not  already  done  so,  begin  by  introducing  yourself.  Then  begin  by  saying: 

TA  NOTE:  The  first  two  paragraphs  below  may  be  skipped  if  the  assessors  are  also  participants.) 

[We  are  asking  you  to  participate  in  a  study  the  FAA  is  conducting  to  develop  a  new  entry-level  selection  system  for 
Air  Traffic  Controllers.  The  goal  of  this  project  is  to  develop  a  testing  system  that  will  identify  the  best  qualified 
applicants  for  the  controller  job. 

As  part  of  this  study ,  we  need  to  collect  assessment  ratings  to  determine  how  well  the  new  selection  tests  are 
working.  To  do  this ,  we  are  asking  peers  and  supervisors  of  the  controllers  who  are  participating  in  our  study  to 
rate  these  controllers ’  job  effectiveness.  If  individuals  who  score  higher  on  the  experimental  tests  are  also 
performing  better  in  their  jobs,  these  tests  will  be  useful  for  identifying  individuals  who  are  likely  to  be  successful  as 
new  Air  Traffic  Controllers.  ] 

The  ratings  you  provide  will  be  used  for  research  purposes  only  and  are  confidential.  No  one  in  the  FAA  will  see  the 
ratings.  These  ratings  will  only  be  used  to  evaluate  the  experimental  selection  tests.  In  fact,  we  have  gone  to  great 
lengths  to  ensure  confidentiality.  At  the  end  of  the  session,  we'll  tear  off  the  bottom  of  the  rating  sheet  with  the 
names  on  it.  The  database  will  contain  only  code  numbers. 

It  is  very  important  that  you  complete  the  ratings  accurately.  In  fact,  if  we  don' t  get  accurate  ratings  the  validation 
process  basically  falls  apart.  Again,  the  results  of  this  study  will  make  a  big  difference  in  the  selection  of  future 
controllers.  So,  it  is  very  important  that  you  complete  your  ratings  as  accurately  as  possible. 

Emphasize  that  the  information  they  are  providing  will  help  improve  the  quality  of  new  controllers  they  will  likely 
have  to  train  and/or  work  with.  Make  sure  they  know  that  accurate  assessments  are  necessary  for  this  to  happen. 
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Hand  out  one  Project  Overview  to  each  assessor  who  is  not  a  participant  and  tell  them  that  this  sheet  contains  more 
detail  concerning  the  project  if  they  are  interested. 

The  raters  will  sometimes  have  a  variety  of  questions  or  comments  concerning  topics  such  as  the  purpose  of  the 
study,  what  is  wrong  with  the  old  selection  test,  etc.  In  order  to  enhance  cooperation,  it  is  best  to  discuss  any 
questions  and  concerns,  even  though  this  will  take  up  some  additional  time.  However,  if  the  questions  and 
discussions  become  too  extensive,  inform  the  group  that  there  are  several  forms  that  need  to  be  completed.  Also,  tell 
them  you  would  be  willing  to  stay  after  the  session  to  answer  questions. 


Completing  the  Criterion  Assessment  Rating  Scales 

Before  beginning  the  ratings,  have  those  assessors  who  are  not  also  participants  complete  a  Biographical 
Information:  Assessor  sheet.  Point  out  that  the  form  is  double-sided. 

Check  to  make  sure  that  these  are  completed  correctly  and  that  no  items  are  left  blank.  Make  sure  they  have  entered 
the  correct  code  number  in  the  upper  right  corner  and  that  they  have  completed  the  back  of  the  form.  Collect  the 
completed  forms,  and  keep  them  secure. 

Then  hand  out  the  Criterion  Assessment  Rating  Sheets  (with  the  Rater  code  number,  Ratee  code  number(s),  and 
Ratee  name(s)  already  recorded)  and  read  the  script  on  the  following  pages: 

Here  are  the  Criterion  Assessment  Rating  Sheets .  At  the  bottom  of  these  rating  sheets  are  the  names  of  the 
controllers  you  will  be  rating ;  the  code  numbers  that  have  been  assigned  to  each  of  these  controllers  appear  near 
the  top .  You  will  mark  your  ratings  for  each  controller  in  the  column  above  his  or  her  name.  Your  code  number 
should  also  appear  at  the  top  of  the  rating  sheet . 

Look  at  the  names  listed  on  the  bottom  of  your  rating  sheet.  These  should  be  controllers  you  have  worked  closely 
with,  that  is,  you  are  very  familiar  with  how  they  do  their  jobs.  By  this  we  mean  controllers  who  have  worked  in 
your  area  for  at  least  6  months,  and  who  you  have  observed  working  traffic  at  least  10  times  a  month,  on  average, 
during  those  6  months.  If  you  do  not  meet  these  criteria  for  one  or  more  of  the  controllers  listed  at  the  bottom  of 
your  rating  sheet,  please  let  me  know  now. 

If  a  rater  does  not  strictly  meet  these  guidelines  for  a  ratee,  but  clearly  knows  the  ratee’s  performance  (e.g.,  worked 
with  him/her  for  several  years  up  to  three  months  ago),  go  ahead  and  have  that  rater  rate  that  ratee.  If  raters  are 
clearly  not  qualified,  make  any  corrections  necessary  on  both  the  rating  sheets  and  on  your  Criterion  Assessment 
Master  Roster. 

Now,  at  the  top  portion  of  the  rating  sheet,  you  will  see  spaces  to  indicate  the  length  of  time  you  have  worked  with 
each  controller  you  are  rating.  (Point  to  this  area  on  the  form).  Please  fill  these  out. 

Finally,  please  think  back  over  the  past  6  months  and  estimate  how  many  times,  per  week,  you  have  worked  with 
each  of  these  controllers.  This  should  be  the  number  of  times  you  actually  sat  down  and  worked  traffic  together, 
which  could  be  more  than  once  a  day.  Record  this  number  in  the  space  provided. 

TA  NOTE:  If  the  experience  with  the  controller  was  not  in  the  past  6  months,  ask  them  to  indicate  how  many  times 
per  week  they  worked  with  the  controller  during  the  time  they  were  working  together. 

Allow  raters  time  to  enter  this  information.  Check  around  the  room  to  be  sure  each  person  is  following  your 
instructions.  Then  hand  out  the  Criterion  Assessment  Scales  and  say: 

Now,  we're  going  to  start  the  assessments  that  I  told  you  about  a  few'  minutes  ago.  Fm  distributing  a  booklet  that 
contains  the  rating  scales.  Please  read  the  instructions  on  the  first  page,  and  then  Til  have  a  few  more  points  to 
make  before  getting  you  started . 

Give  them  a  few  minutes  to  read  the  instructions.  Make  sure  they  don’t  start  making  their  ratings  until  after  you 
complete  the  briefing. 
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OK,  please  open  your  rating  booklet  to  the  first  category  entitled ,  Maintaining  Safe  &  Efficient  Air  Traffic  Flow 
(hold  this  page  up  to  the  group). 


The  most  important  parts  of  these  rating  scales  are  the  Rating  Standards  that  describe  exactly  what  we  mean  by 
exceptional  (point  to  the  statement  on  the  far  right),  fully  adequate  (point  to  the  middle  statement),  and  below 
average  (point  to  the  statement  on  the  far  left)  effectiveness  in  each  category.  These  behavioral  statements  or 
benchmarks  should  make  the  ratings  more  objective  because  we  are  asking  you  to  compare  the  performance  of  each 
controller  you  are  rating  with  the  behavioral  benchmarks  on  the  scales. 

Now  T  d  like  to  go  through  a  few  examples  of  how  this  comparing  or  matching  process  should  proceed.  Let's  look 
again  at  the  category  of  Maintaining  Safe  &  Efficient  Air  Traffic  Flow.  If  you  believe  that  the  middle  (point  to 
them)  statements  best  describe  the  controller's  most  typical  effectiveness  in  this  area ,  then  you  should  give  that 
person  a  rating  of  “ 4 If  the  statements  on  the  far  right  (point  to  them)  best  match  the  controller's  most  typical 
behavior ,  choose  a  rating  of  “7.”  Likewise ,  if  the  statements  described  on  the  far  left  (point  to  them)  match  the 
controller's  most  typical  behavior ,  choose  a  rating  of  “7.”  However ,  we  have  found  that  often  this  matching  doesn't 
line  up  that  simply. 

For  example ,  you  may  feel  that  the  middle  statements  and  the  low  statements  describe  the  controller's  effectiveness 
at  times ,  but  that  his  or  her  typical  effectiveness  is  more  like  the  middle  statements.  If  this  is  the  case ,  an  evaluation 
of  “3"  would  be  best.  As  a  final  example ,  if  the  controller  has  most  often  performed  like  the  high  statements  but  at 
times  performs  at  the  middle  level  as  well,  a  “ 6 "  would  be  the  best  rating. 

The  main  point  here  is  that  for  each  category ,  you  are  to  compare  your  observations  of  each  controller' s 
effectiveness  to  the  behavioral  statements  or  Rating  Standards  and  then  select  the  number  that  best  reflects  the 
controller's  effectiveness. 

One  thing  I'd  like  to  bring  to  your  attention  is  that  the  performance  described  in  the  high  statement  is  truly 
outstanding .  For  a  controller  to  be  rated  a  “6”  or  “7,”  he  or  she  should  perform  as  described  in  the  high 
statements  most  of  the  time.  I  am  not  suggesting  that  there  are  no  truly  outstanding  controllers ,  simply  that  you 
should  reserve  these  ratings ,  especially  the  u7" ,  for  the  very  high  performing  controllers. 

Once  you  have  selected  a  number ,  blacken  the  appropriate  circle  on  the  Criterion  Assessment  Rating  Sheet. 

Does  anyone  have  any  questions? 

Now  let's  go  through  the  “ Important  Points  to  Remember"  when  making  your  evaluations. 

(Hold  up  a  rating  booklet  and  show  them  what  you  are  referring  to.) 

The  first  point  to  remember  is,  try  not  to  give  a  controller  the  same  rating  for  all  ten  categories.  It  is  unlikely  that 
any  one  person  performs  at  exactly  the  same  level  in  all  ten  rating  categories.  Instead,  most  people  will  be  more 
proficient  in  some  categories  and  less  proficient  in  others.  Your  evaluations  should  reflect  each  controller's 
strengths  and  weaknesses. 

TA  Note:  You  can  skip  the  following  paragraph  if  no  one  in  the  session  is  rating  multiple  controllers. 

[The  second  point  is  if  you  are  evaluating  multiple  controllers,  try  not  to  give  all  of  them  the  same  rating  within  an 
individual  category.  Again ,  it  is  unlikely  that  all  of  the  people  you  are  evaluating  perform  at  the  exact  same  level  of 
proficiency  within  a  given  category.  Thus,  your  ratings  should  show  who  is  more  and  less  effective  within  each 
rating  category.] 

Another  thing  that  can  happen  is  that  raters  sometimes  let  things  that  have  nothing  to  do  with  performance  affect 
their  evaluations,  such  as  friendship  or  simply  liking  the  controller.  These  assessment  scales  target  only  job 
performance  and  that's  what  you  should  base  your  ratings  on. 

Now  that  I  have  gone  through  some  possible  rating  problems,  there's  one  last  point  I  want  to  stress.  That  is,  the 
most  important  guidance  is  to  be  as  accurate  as  possible  in  your  evaluations.  If  you  really  believe,  for  example,  that 
three  controllers  should  be  given  the  same  rating  in  a  category  or  that  one  person  performs  at,  let's  say,  the  “ 5 " 
level  in  several  categories,  then  you  should  rate  them  in  this  way.  However,  where  there  are  strengths  and 
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weaknesses  for  a  controller  you  are  rating  or  differences  between  controllers,  it  is  important  that  your  ratings 
reflect  these  strong  points,  weak  points,  and  differences  between  controllers. 

TA  Note:  You  can  skip  the  following  paragraph  if  no  one  in  the  session  is  rating  multiple  controllers. 

[If  you  are  rating  more  than  one  controller,  it  will  be  easiest  for  you  to  rate  each  one  on  Category  A,  then  go  to 
Category  B  and  rate  everyone  on  that  category  and  so  on.] 

Walk  around  and  check  the  ratings;  make  sure  they  are  filling  in  the  answer  forms  correctly  (dark  marks  and  circles 
filled).  Note  obvious  problems  such  as  all  ratees  being  evaluated  at  exactly  the  same  level,  and,  if  possible,  ask  if 
this  is  what  the  rater  intends.  That  is,  if  a  rater  is  rating  all  controllers  at  the  same  level  of  performance  on  many 
categories,  ask  him  or  her  if  these  individuals  really  do  perform  at  the  same  level. 

Encourage  raters,  as  appropriate  -  “looking  good”,  “looks  like  you’ve  got  it,”  etc.;  answer  individual  questions  as 
they  arise.  Make  sure  they  are  rating  all  controllers  on  one  category  before  proceeding  to  the  next  category. 

Some  raters  may  indicate  that  they  have  not  observed  performance  sufficiently  in  one  or  more  of  the  categories  to 
make  a  rating.  Encourage  the  rater  to  make  a  rating  if  at  all  possible.  If  the  rater  still  feels  incapable  of  making  a 
rating,  tell  the  rater  to  leave  that  category  blank. 

The  raters  will  likely  finish  their  ratings  at  different  times. 

As  you  collect  the  Criterion  Assessment  Rating  Sheets,  check  to  make  sure  the  code  numbers  are  correct.  Also 
check  that  the  peer/supervisor  distinction  and  “length  of  time  worked  with”  the  ratee  was  completed  on  each  rating 
sheet  and  that  one  rating  was  filled  in  for  each  category.  Make  sure  any  errors  are  corrected  before  collecting  the 
rating  forms.  Once  you  are  comfortable  that  the  forms  have  been  filled  in  correctly,  ask  the  rater  to  remove  the  ratee 
names.  The  rating  sheets  are  perforated,  so  the  bottom  portion  that  lists  the  ratee  names  should  tear  off  easily. 

Then,  give  each  rater  a  Confidential  Envelope,  and  ask  them  to  insert  their  rating  sheets  into  this  Confidential 
Envelope,  and  seal  the  envelope.  Collect  the  rating  booklets  separately,  as  they  can  be  used  again.  Collect  all 
materials  that  have  been  handed  out  during  the  session,  including  the  ratee  names  (removed  from  the  rating  sheet) 
and  the  code  number  cards.  Check  off  on  your  Criterion  Assessment  Master  Roster  that  the  ratings  are  “DONE”  for 
the  individuals  rated. 
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Over  the  Shoulder  (OTS) 
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AT-SAT  High  Fidelity  Simulation  Over  The  Shoulder  (OTS)  Rating  Form 

Administrative  Information  -  Page  1 

Scenario  Number:  HFG  1  2  3  4  5  6  7 

Lab  Number: 

1  2 

Position:  123456789  10 

Participant  ID  Number: 

Rater 

ID  Number: 

AT-SAT  High  Fidelity  Simulation  Over  The  Shoulder  (OTS)  Rating  Scales 

Rating  Scale 

Rating  Dimensions 

Below 

Average 

Fully 

Adequate 

Excep¬ 

tional 

A.  Maintaining  Separation 

©  © 

®  ©  © 

©  © 

•  Checks  separation  and  evaluates  traffic  movement  to  ensure  • 
separation  standards  are  maintained 

Considers  aircraft  performance  parameters  when  issuing 
clearances 

•  Detects  and  resolves  impending  conflictions 

• 

Establishes  and  maintains  proper  aircraft  identification 

•  Applies  appropriate  speed  and  altitude  restrictions 

• 

Properly  uses  separation  procedures  to  ensure  safety 

•  Analyzes  pilot  requests,  plans  and  issues  clearances 

• 

Issues  safety  and  traffic  alerts 

B.  Maintaining  Efficient  Air  Traffic  Flow 

©  © 

©  ©  © 

©  © 

•  Accurately  predicts  sector  traffic  overload  and  takes 
appropriate  action 

• 

When  necessary,  issues  a  new  clearance  to  expedite  traffic 
flow 

•  Ensure  clearances  require  minimum  flight  path  changes  • 

Reacts  to/resolves  potential  conflictions  efficiently 

•  Controls  traffic  so  as  to  ensure  efficient  and  timely  traffic  flow 

C.  Maintaining  Attention  and  Situation  Awareness 

©  © 

®  ©  © 

©  © 

•  Maintains  awareness  of  total  traffic  situation 

• 

Reviews  and  ensures  appropriate  route  of  flight 

•  Recognizes  and  responds  to  pilot  deviations  from  ATC  • 

clearances 

Scans  properly  for  air  traffic  events,  situations,  potential 
problems,  etc. 

•  Listens  to  readbacks  and  ensures  they  are  accurate 

• 

Remembers,  keeps  track  of,  locates,  and  if  necessary  orients 
aircraft 

•  Assigns  requested  altitude  in  timely  manner 

• 

Descends  arrivals  in  timely  manner 

•  Keeps  data  blocks  separated 

• 

Accepts/performs  timely  handoffs 

D.  Communicating  Clearly,  Accurately,  and  Efficiently 

©  © 

©  ©  © 

©  © 

•  Issues  clearances  that  are  complete,  correct,  and  timely  • 

Communicates  clearly  and  concisely 

•  Makes  only  necessary  transmissions 

• 

Uses  correct  call  signs 

•  Uses  standard/prescribed  phraseology 

• 

Uses  appropriate  speech  rate 

•  Properly  establishes,  maintains,  and  terminates 
communications 

• 

Listens  carefully  to  pilots  and  controllers 

•  Avoids  lengthy  clearances 

• 

Issues  appropriate  arrival  and  departure  information 
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AT-SAT  High  Fidelity  Simulation  Over  The  Shoulder  (OTS)  Rating  Form  -  Page  2 


Rating  Scale 

Rating  Dimensions 

Below 

Average 

Fully 

Adequate 

Excep¬ 

tional 

E.  Coordinating 

©  © 

©  ©  © 

©  © 

•  Performs  handoff  and  pointout  procedures  correctly 

•  Effectively  coordinates  clearances,  changes  in  aircraft 
destinations,  altitudes,  etc. 

•  Provides  complete/accurate  position  relief  briefings 


F.  Performing  Multiple  Tasks 


•  Shifts  attention  between  several  aircraft  when 
necessary 

•  Keeps  track  of  a  large  number  of  aircraft/events  at  a  time 

•  Prioritizes  activities  effectively 


G.  Managing  Sector  Workload 


•  Handles  heavy,  emergency,  and  unusual  traffic  situations 
effectively 

•  Stays  calm,  focused,  and  functional  in  busy  and  stressful 
conditions 

•  Responds  to  imposed  airspace  restrictions 

•  Responds  to  traffic  management  constraints/initiatives 


H.  Overall  Performance  | 


•  Performs  required  coordinations  effectively 

•  Initiates  and  receives  handoffs  and  pointouts  in  an 
efficient  and  effective  manner 

•  Processes  flight  plans/amendments  as  required 


CD  ©  © 


•  Communicates  in  a  timely  fashion  while  performing  other 
actions 

•  Returns  to  what  he/she  was  doing  after  an  interruption 


(D  ©  © 


Handles  unexpected  situations  effectively  (e.g., 
computer/communication  failures) 

Deals  effectively  with  situations  for  which  there  may  not 
be  clearly  prescribed  procedures 

Uses  contingency  or  “fall-back”  strategies  effectively 


©  ©  CD  ©  ©  ©  © 
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Behavioral  and  Event  Checklist 


HFG1 

Behavioral  and  Event  Checklist 


Event 

Aircraft  identity 

Totals 

Operational  Errors 

(Write  both  call  signs  in  one  box) 

5. 

1. 

6. 

2. 

7. 

3. 

8. 

4. 

9. 

Operational  Deviations/SUA  violations 
(Write  call  sign  in  each  box) 

5. 

L 

6. 

2. 

7. 

3. 

8. 

4. 

9. 

Behavior 

Number  of  events 

Totals 

Failed  to  accept  handoff 

LO A/Directive  Violations 

Readback/Hearback  errors 

Failed  to  accommodate  pilot  request 

Made  late  frequency  change 

Unnecessary  delays 

Incorrect  information  in  computer 

Participant  ID  Number: 

Rater  ID  Number: 

Lab  Number: 

Position  Number: 
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AT-SAT  High  Fidelity 
Standardization  Guide 


AT-SAT  High  Fidelity  Standardization  Guide 

The  following  rules  and  interpretations  of  rules  have  been  agreed  to  and  will  be  used  in  evaluations  by  all  AT-SAT 
Raters  in  addition  to  rules  set  forth  in  FAA  Handbook  71 10.65,  Aero  ARTCC  and  Tulsa  ATCT  Letter  of 
Agreement,  Aero  ARTCC  and  McAlester  ATCT  Letter  of  Agreement,  and  Aero  ARTCC,  Memphis  ARTCC, 
Kansas  City  ARTCC,  Fort  Worth  ARTCC  Letter  of  Agreement. 

General  Stuff 

All  aircraft  have  to  be  vectored  for  straight-in  ILS  approach  to  MLC. 

If  aircraft  goes  into  TUL  airspace  then  back  out,  just  rate  performance  for  the  first  time  the  aircraft  is  in  your 
airspace. 

If  you  make  a  mistake  when  filling  out  any  of  the  forms,  either  erase  the  mark  or  draw  a  squiggly  line  through  the 
incorrect  mark. 

If  participant  fails  to  say  ‘‘Radar  service  terminated  ”  don’t  mark  any  Remaining  Actions,  but  consider  when  making 
OTS  ratings. 

If  the  pilot  makes  a  mistake  that  results  in  an  OE  or  OD,  mark  on  behavioral  checklist,  put  an  asterisk  next  to 
indicator,  and  explain  circumstance.  If  pilot  causes  OE  or  OD,  the  1/2  rule  does  not  apply  (1  OE  =  OTS  rating  of  2 
in  Category  A,  2  OES  =  OTS  rating  of  1). 


Behavioral  Checklist 


Operational  Errors 

An  Operational  Error  is  considered  to  occur  if  a  non-radar  clearance  does  not  provide  for  positive  separation, 
regardless  if  controller  corrects  error  prior  to  loss  of  radar  separation. 

If  the  participant  makes  one  Operational  Error,  the  rater  shall  assign  a  rating  no  higher  than  2  in  the  Maintaining 
Separation  (A)  category  on  the  OTS  rating  form.  If  the  participant  makes  two  Operational  Errors,  the  rater  shall 
assign  a  rating  no  higher  than  1  in  the  Maintaining  Separation  (A)  category  on  the  OTS  rating  form.  If  participant 
makes  no  OEs,  rater  may  assign  any  number  for  category  A.  Making  an  operational  error  will  not  necessarily  affect 
ratings  for  other  categories  except  that  if  a  participant  is  rated  low  on  A  (Maintaining  Separation)  on  the  OTS  form, 
they  will  also  probably  be  rated  low  on  C  (Maintaining  Attention  and  Situation  Awareness). 

If  an  aircraft  is  cleared  off  an  airport,  is  auto-acquired  off  the  departure  list,  but  the  participant  is  not  yet  talking  to 
the  aircraft,  it  is  NOT  an  OE  if  another  aircraft  is  cleared  for  approach  into  that  same  airport. 

If  an  aircraft  is  cleared  below  the  MIA,  it  is  an  OE. 

It  an  aircraft  is  cleared  for  approach  without  telling  the  pilot  to  maintain  a  specific  altitude,  it  is  an  OE. 

If  an  aircraft  without  Mode  C  doesn’t  report  level,  the  participant  doesn’t  determine  a  reported  altitude,  and  the 
aircraft  flies  over  another  aircraft,  it  shall  be  scored  as  an  OE.  Also,  if  the  participant  doesn’t  enter  a  reported 
altitude  in  the  computer,  it  shall  also  be  scored  as  Incorrect  Information  in  Computer. 

Operational  Deviations 

An  Operational  Deviation  is  considered  to  occur  if  there  is  a  violation  of  published  MEAs. 

An  Operational  Deviation  is  considered  to  occur  if  an  aircraft  comes  within  2.5  miles  of  the  airspace  of  another 
facility  without  being  handed  off.  If  the  scenario  freezes  before  the  aircraft  gets  within  2.5  miles  of  another  facility’s 
airspace  and  it  hasn’t  yet  been  handed  off,  count  as  Make  Handoff  under  Remaining  Actions. 

An  Operational  Deviation  occurred  if  the  participant  failed  to  point  out  an  aircraft  to  the  appropriate  sector  or  if  the 
participant  issued  a  clearance  to  an  aircraft  while  it  is  within  2.5  miles  of  the  airspace  boundary.  Raters  should  check 
the  location  of  the  aircraft  when  a  clearance  is  issued  to  see  if  it  is  within  2.5  miles  of  the  boundary.  If  it  is,  an  OD 
should  be  counted. 

Special  Use  Airspace  Violation 

A  Special  Use  Airspace  violation  is  considered  to  occur  if  an  aircraft  does  not  remain  clear  of  P57  or  if  an  aircraft 
does  not  clear  Restricted  Area  R931A  by  either  3  NM  or  500  feet  of  altitude. 
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Accepted  Handoff/Pointout  Late 

Acceptance  of  a  Handoff/Pointout  will  be  considered  late  if  the  radar  target  is  within  2.5  NM  of  1)  Tulsa  Approach 
boundary  if  the  aircraft  is  exiting  Tulsa  Approach  airspace  or  2)  crossing  the  Aero  Center  boundary  if  the  aircraft  is 
transiting  En-Route  airspace. 

LOA/Directive  Violation 

A  violation  of  the  Tulsa  Letter  of  Agreement  is  considered  to  occur  if  a  jet  aircraft  is  not  established  at  250  knots 
prior  to  crossing  the  appropriate  arrival  fix,  if  an  aircraft  is  not  level  at  prescribed  arrival  altitudes  at  appropriate 
arrival  fix,  even  if  a  different  altitude,  etc.,  was  coordinated,  or  if  aircraft  are  not  appropriately  spaced. 

There  will  be  no  blanket  coordination  of  altitude  or  speed  restrictions  different  than  those  specified  in  the  LO  A.  For 
specific  circumstances  when  pilots  aren’t  going  to  meet  crossing  restrictions,  if  that  is  coordinated,  it  won’t  be 
counted  as  an  LOA  violation. 

Count  as  LOA/Directive  Violation  if  a  frequency  change  is  issued  prior  to  completion  of  a  handoff  for  the 
appropriate  aircraft,  if  the  participant  changes  frequency  but  did  not  terminate  radar,  or  if  the  participant  flashed  the 
aircraft  too  early. 

Count  as  LOA/Directive  Violation  if  the  participant  failed  to  forward  a  military  change  of  destination  to  FSS. 

Count  as  LOA/Directive  Violation  if  the  participant  makes  a  handoff  to  and  switches  the  frequency  to  the  incorrect 
facility. 

Count  as  LOA/Directive  Violation  if  the  participant  drops  a  data  block  while  the  aircraft  is  still  inside  the  airspace. 
Count  as  LOA/Directive  Violation  if  the  participant  fails  to  inform  the  pilot  of  radar  contact. 

If  participant  has  an  LOA/Directive  Violation,  also  mark  as  Coordination  error.  If  mark  several  violations,  consider 
marking  down  Coordination  and  overall  categories. 

Failed  to  Accommodate  Pilot  Request 

Participants  shall  be  rated  as  failing  to  accommodate  a  pilot  request  if  the  controller  never  takes  appropriate  action  to 
accommodate  the  request,  if  the  controller  says  unable  when  he/she  could  have  accommodated  the  request,  or  if  the 
controller  says  stand  by  and  never  gets  back  to  the  pilot.  This  situation  applies  if  the  rater  determines  that  the 
controller  could  have  accommodated  the  request  without  interfering  with  other  activities.  Rater  must  balance  failing 
to  accommodate  pilot  requests  or  other  delays  against  factors  involved  in  Managing  Sector  Workload. 

If  another  facility  calls  for  a  clearance  and  the  participant  fails  to  issue  it  unnecessarily,  counts  as  Delay,  not  as 
Failure  to  Accommodate  Pilot  Request. 

Unnecessary  Delay 

An  unnecessary  delay  is  considered  to  occur  if  a  pilot  request  can  be  accommodated  and  the  controller  delays  in 
doing  so,  if  the  participant  levels  any  departure  at  an  altitude  below  the  requested  altitude  and  there  was  no  traffic,  or 
if  an  aircraft  previously  in  holding  due  to  approaches  or  departures  at  MIO  and  MLC  airports  is  not  expeditiously 
cleared  for  approach. 

If  the  participant  leaves  an  aircraft  high  on  the  localizer  it  is  considered  a  delay  if  the  pilot/computer  says  unable.  If 
the  pilot/computer  does  not  say  unable  but  the  participant  could  have  descended  the  aircraft  sooner,  count  down  on 
category  C  (Maintaining  Attention  and  Situation  Awareness). 

If  another  facility  calls  for  a  clearance  and  the  participant  fails  to  issue  it  unnecessarily,  counts  as  Delay,  not  as 
Failure  to  Accommodate  Pilot  Request. 

Incorrect  Information  in  Computer 

If  an  aircraft  does  not  have  Mode  C,  the  participant  shall  enter  the  reported  altitude  1)  when  the  pilot  reports  it, 

2)  prior  to  handoff,  or  3)  by  the  end  of  the  scenario.  If  this  does  not  happen,  count  as  Incorrect  Information  in 
Computer,  Also,  see  OE. 

Incorrect  Information  in  Data  Block 

Altitude  information  in  data  blocks  shall  be  considered  incorrect  if  and  when  reported  altitude  differs  by  1000  feet  or 
more  from  assigned  altitude  displayed  in  same  data  block. 

OTS  Rating  Form 
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Coordinating 

In  the  event  any  information  needs  to  be  passed  to  a  supervisor,  the  AT-SAT  Rater  shall  be  considered  acting  as 
same  supervisor.  Coordination  of  climbing  aircraft  shall  NOT  be  required  as  long  as  the  aircraft’s  data  block/flight 
plan  correctly  displays  the  aircraft’s  assigned  altitude. 

If  participant  doesn’t  enter  computer  information  (for  example,  change  in  route),  enters  incomplete  information,  or 
enters  information  in  the  computer  for  the  wrong  aircraft,  rate  them  down  under  OTS  Category  E  (Coordination). 
Don’t  mark  the  Behavioral  Checklist  or  use  the  Remaining  Actions  form.  This  is  not  to  be  rated  as  an  OD. 

If  participant  didn’t  coordinate  a  WAFDOF  for  aircraft  within  2.5  miles  of  sector  boundary,  it  counts  as  a 
coordination  error  (Category  E  on  OTS).  If  scenario  freezes  before  coordination  occurred  but  there  was  still  time  to 
accomplish  coordination  within  2.5  miles  of  sector  boundary,  doesn’t  count  against  Coordinating  category  (E)  on  the 
OTS.  Instead  count  as  Required  Coordination  on  Remaining  Actions  form. 

For  specific  circumstances  when  pilots  aren’t  going  to  meet  crossing  restrictions,  if  that  is  not  coordinated,  it  will  be 
counted  as  an  LOA  violation  and  coordination  error. 

If  participant  has  an  LO A/Directive  Violation,  also  mark  as  coordination  error.  If  mark  several  violations,  consider 
marking  down  Coordination  and  overall  categories. 

Managing  Sector  Workload 

If  participant  doesn’t  meet  TMU  in-trail  restriction,  count  under  G  (Managing  Sector  Workload). 
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Differences  Between  Rater  Pairs  by  Ratee  and  Scenario  for  Each  OTS  Dimension  and  the  BEC 
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APPENDIX  I 


Sample  Cover  Letter 
and 

Table  to  Assess  the  Completeness  of  Data  Transmissions 
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Hiim&n  Resources  Research  Organization 


65  Canal  GerJcf  Flara  Sj3r  433  *  AteKaJicha.  tfA  223  M  1591 

f?C3i Mg-®! *  **».  rra;  549  9325 & 546  5574 

27  June  1997 


Dear  AT-SAT  Test  Site  Manager, 

With  testing  winding  down,  this  is  a  good  opportunity  to  compare  the  data  that  we 
have  currently  processed  from  your  site  with  ihc  information  contained  m  your  records. 
To  this  end,  pl  ease  find  enclosed  two  tables 

The  first  table  Lists  the  transmission  numbers  and  the  date  of  those  transmissions 
that  we  have  already  processed  from  your  site.  As  you  might  imagine,  them  is  a  Lag  in 
the  time  between  the  date  that  a  packet  of  data  is  transmitted  to  us  and  the  date  that  we 
process  it  and  add  it  to  our  data  base  Therefore,  please  do  not  be  concerned  if  a 
transmission  that  you  have  already  sent  us  is  not  listed  on  this  table;  even  an  early 
transmisiosi  may  be  missing  from  this  list  if  we  ore  still  waiting  for  some  additional 
information  from  you  or  if  we  only  recently  processed  it.  Tills  first  table  Is  enclosed  to 
help  you  work  through  the  second  table,  which  lists  the  participants  you  have  tested  by 
the  tests  and  forms  that  they  have  completed. 

Specifically,  an  asterisk  on  this  second  table  indicates  the  presence  of  predictor 
and  CRPM  tests  for  participants  firm  the  transmissions  listed  in  (he  first  table,  as  well  as 
presence  of  S$N  Request  and  BloData  Forms.  A  blank  cell  on  this  second  table  indicates 
that  we  do  not  have  the  corresponding  tests/fornts  firm  the  transmissions  listed  in  the 
first  table.  Because  of  the  data  processing  time  lag,  you  may  see  some  SSN  Request  and 
Bio  Data  Forms  from  transmissions  that  are  not  listed  on  the  first  tabic. 

Please  review  both  tables  carefully.  For  each  blank  cell  on  the  second  table, 
please  provide  the  date  that  you  transmitted  the  information  to  us  in  the  cell  itself.  In 
addition  to  missing  information,  please  review  participant  identification  numbers  for 
accuracy  and  completeness.  If  required,  please  provide  additional  explanations  on  a 
separate  sheet  of  paper.  Please  fax  all  this  information  to  me  at  (703)  706-5623  by  July 
7,  1997.  Your  time,  hard  work,  and  diligence  is  greatly  appreciated. 


oeerely, 


Ani  S.  DiFa*io 


Senior  Scientist 
AT-SAT  Data  Base  Manager 
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Friday,  June  27.  1997 

THE  TABLE  ON  THE  NEXT  PAGE  LISTING  EXAMINEE  ID  NUMBERS  BY  TESTS/FORMS  COLLECTED 
WERE  CONTAINED  IN  THE  FOLLOWING  TRANSMISSIONS  RECEIVED  FROM  YOUR  SITE 


SITE=08  NAME=Los  Angeles 
TRANSMISSION  TRANSMISSION 


NUMBER 

DATE* 

1 

05/12/97 

2 

05/13/97 

3 

05/14/97 

4 

05/15/97 

5 

05/16/97 

6 

05/17/97 

7 

05/19/97 

8 

05/20/97 

9 

05/21/97 

10 

05/22/97 

11 

05/23/97 

12 

05/24/97 

13 

05/26/97 

14 

05/31/97 

15 

06/02/97 

*  If  a  range  of  dates  was  listed  on  the  Data  Transmittal  Form,  the  first  in  the  range  is  listed  here 
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Table  1-1  Table  of  Examinees  and  Tests/forms  Received  for  Data  Processing  Site=08 


ID 

NUMBER 

08003 

08004 

08006 

08011 

08016 

08017 

08019 

08021 

08022 

08023 

08024 

08025 

08027 

08028 

08029 

08030 

08031 

08033 

08034 

08035 

08036 

08038 

08039 

08040 

08042 

08043 

08046 

08050 

08054 

08055 

08056 

08058 

08060 

08063 

08065 

08066 

08069 

08070 

08072 

08073 


ST  DI 
*  * 

*  * 
*  * 
*  * 
*  * 
*  * 
*  * 
*  * 
*  * 

*  * 

*  * 

*  * 

*  * 

*  * 

*  * 

♦  * 

*  * 

*  * 

*  * 

*  * 

*  * 

*  * 

*  * 

.*  * 

*  * 

*  * 

*  * 

*  * 

*  * 

*  * 

♦  * 

*  * 

*  * 

*  * 

*  * 

♦  * 


SN  LA 
*  * 

*  * 
*  * 

*  * 

*  * 

*  * 

*  * 

*  * 

*  * 

♦  * 

*  * 


AM  SC 
*  * 

*  * 
*  * 
*  * 

*  * 

♦  * 

*  * 

*  '* 

* 

*  * 

*  * 


*  * 
*  * 

♦  * 

*  * 

*  * 

*  * 

*  * 

*  * 

♦  * 


*  * 
*  * 

*  '  * 
*  * 

*  * 

*  * 

♦  * 

*  * 

*  * 


*  * 

*  * 

*  * 

♦  * 

*  * 

*  * 

*  * 

*  * 


♦  * 

♦  * 

*  * 

*  * 

*  * 

*  * 

*  * 

*  * 


*  *  *  * 

*  *  *  * 

*  *  *  * 

*  *  *  * 

*  *  *  * 


*  *  *  * 

*  *  *  * 

*  *  *  * 


AN  AY  ME 

*  *  * 

*  *  * 

*  *  * 

*  *  * 

*  *  ♦ 

*  *  * 

*  *  * 

*  *  * 

♦  *  * 

*  *  * 

*  *  * 

*  *  * 

♦  *  * 

*  *  * 

♦  *  * 

*  *  * 

*  *  * 

♦  *  * 

*  *  * 

*  *  * 

*  *  * 

*  *  * 

*  *  * 

*  *  ♦ 

*  *  * 

♦  *  * 

*  *  * 

♦  *  * 

♦  *  * 

♦  *  * 

*  *  * 

*  *  * 

*  *  * 

*  *  * 

♦  *  * 

*  *  * 


AT  EQ  MR  TW 
*  *  *  * 

*  *  *  * 

*  *.'*.'* 

*  *  *  * 

♦  *  *  * 

*  *  *  * 

*  *  *  * 

*  *  *  *' 

♦  *  *  * 

*  *  *  * 

*  *  * 


*  * 
*  * 

*  * 

★  * 

*  .  * 
♦  * 

*  * 

*  * 

*  * 


*  * 

*  * 

*  * 

*  * 

*  * 

*  * 

*  * 

*  * 

*  * 


*  * 

*  * 

*  * 

★  * 

♦  * 

*  * 

*  * 

*  * 

*  * 

*  * 

*  * 

*  * 

★  * 


*  * 
* 

* 

*  * 

*  * 

*  * 

* 

* 

*  * 

♦  * 

*  * 

*  * 

*  * 


*  *  *  * 

♦  *  *  * 

♦  *  *  * 


PL  CBPMSSN  BIO 
*  *  *  * 

*'  *  * 

*  *  *  * 

* 

*  *  * 

*  .  *  *  * 

*  \  *  *  * 

*  *  *  * 

*  *  *  * 

*  *  *  * 

*  *  *  * 

*  *  * 

*  *  *  * 

*  *  *  * 

*  ★  *  * 

* 

*  *  *  * 

*  *  *  * 

*  *  * 

*  *  *  * 

*  *  * 

♦  * 

*  *  *  * 

* 

*  * 

*  *  *  * 

*  *  *  * 

*  *  *  * 

*  *  *  * 

*  *  *  * 

*  * 

*  *  * 

*  *  *  * 

*  *  *  * 

*  *  *  * 

*  *  +  * 

♦  * 

*  *  *  * 

*  *  *  * 

*  *  *  * 


An  *  indicates  the  presence  of  a  test  or  form  for  that  examinee  in  one  of  the 
transmissions  listed  in  the  previous  table.  A  blank  indicates  that  no  test 
or  form  was  received. 
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